Kubernetes学习笔记
- 1: 云原生体系
- 2: 安装与配置
- 2.1: 部署k8s集群
- 2.1.1: 使用kubeadm安装生产环境kubernetes
- 2.1.2: 使用kubespray安装kubernetes
- 2.1.3: 使用minikube安装kubernetes
- 2.1.4: 使用kind安装kubernetes
- 2.2: k8s证书及秘钥
- 2.3: k8s版本说明
- 3: 基本概念
- 3.1: kubernetes架构
- 3.1.1: Kubernetes总架构图
- 3.1.2: 基于Docker及Kubernetes技术构建容器云(PaaS)平台
- 3.2: kubernetes对象
- 3.2.1: Kubernetes基本概念
- 3.2.2: 理解kubernetes对象
- 3.3: Pod对象
- 3.3.1: Pod介绍
- 3.3.2: Pod定义文件
- 3.3.3: Pod生命周期
- 3.3.4: Pod健康检查
- 3.3.5: Pod存储卷
- 3.3.6: Pod调度
- 3.3.7: Pod伸缩与升级
- 3.4: 配置
- 3.4.1: ConfigMap
- 4: 核心原理
- 4.1: 核心组件
- 4.1.1: Kubernetes核心原理(一)之API Server
- 4.1.2: Kubernetes核心原理(二)之Controller Manager
- 4.1.3: Kubernetes核心原理(三)之Scheduler
- 4.1.4: Kubernetes核心原理(四)之kubelet
- 4.2: 流程图
- 5: 容器网络
- 6: 容器存储
- 6.1: 存储卷概念
- 6.2: CSI
- 6.2.1: csi-cephfs-plugin
- 6.2.2: 部署csi-cephfs
- 6.2.3: 部署cephfs-provisioner
- 6.2.4:
- 7: 资源隔离
- 7.1:
- 7.2:
- 7.3:
- 7.4: Lxcfs资源视图隔离
- 8: 运维指南
- 8.1: Kubernetes集群问题排查
- 8.2: kubectl工具
- 8.2.1: kubectl安装与配置
- 8.2.2: kubectl命令说明
- 8.2.3: kubectl命令别名
- 8.3: 节点调度
- 8.4: 镜像仓库
- 9: 开发指南
- 9.1: client-go的使用及源码分析
- 9.2: operator开发
- 9.2.1: kubebuilder的使用
- 9.3: CSI插件开发
- 9.3.1: csi-provisioner源码分析
- 9.3.2: nfs-client-provisioner源码分析
- 10: 问题排查
- 10.1: 节点问题
- 10.1.1: keycreate permission denied
- 10.1.2: Cgroup不支持pid资源
- 10.1.3: Cgroup子系统无法挂载
- 10.2: Pod驱逐
- 10.3: 镜像拉取失败问题
- 10.4: PVC Terminating
- 11: 源码分析
- 12: Runtime
- 12.1: Runc和Containerd概述
- 12.2: Containerd
- 12.2.1: 安装Containerd
- 12.2.2:
- 12.2.3:
- 12.3: Docker
- 12.4: Kata Container
- 12.5: GPU
- 12.5.1: nvidia-device-plugin介绍
- 13: Etcd集群
- 13.1: Etcd介绍
- 13.2: Raft算法
- 13.3: etcdctl命令工具
- 13.3.1: etcdctl-V3
- 13.3.2: etcdctl-V2
- 13.4: Etcd访问控制
- 13.5: Etcd启动配置参数
- 13.6: Etcd中的k8s数据
- 13.7: etcd-operator的使用
- 14: 多集群管理
- 14.1: k8s多集群管理的思考
- 14.2: Virtual Kubelet
- 14.2.1: Virtual Kubelet介绍
- 14.2.2: Virtual Kubelet命令
- 14.3: Karmada
- 14.3.1: Karmada介绍
- 15: 边缘容器
- 15.1: KubeEdge
- 15.1.1: KubeEdge介绍
- 15.1.2: KubeEdge源码分析
- 15.2: OpenYurt
- 15.2.1: OpenYurt部署
- 16: 虚拟化
- 16.1: 虚拟化相关概念
- 16.2: KubeVirt
- 16.2.1: KubeVirt的介绍
- 16.2.2: KubeVirt的使用
- 17: 监控体系
- 17.1: Kubernetes集群监控
- 17.2: cAdvisor介绍
- 17.3: Heapster介绍
- 17.4: Influxdb介绍
1 - 云原生体系
1.1 - k8s知识体系
1. k8s知识体系
以下整理了k8s涉及的相关知识体系。

思维导图:https://www.processon.com/view/link/5d7f7d08e4b03461a3a937e2
2. k8s重点开源项目
TODO
1.2 - 12 Factor
以下主要介绍
PaaS平台设计架构中使用到的方法论,统称为12-Factor(要素)
简介
软件通常会作为一种服务来交付,即软件即服务(SaaS)。12-Factor原则为构建SaaS应用提供了以下的方法论:
- 使用标准化流程自动配置,减少开发者的学习成本。
- 和操作系统解耦,使其可以在各个系统间提供最大的移植性。
- 适合部署在现代的云计算平台上,从而在服务器和系统管理方面节省资源。
- 将开发环境与生产环境的差异降至最低,并使用持续交付实施敏捷开发。
- 可以在工具、架构和开发流程不发生明显变化的前提下实现拓展
该理论适应于任何语言和后端服务(数据库、消息队列、缓存等)开发的应用程序。
1. 基准代码
一份基准代码,多份部署
应用代码使用版本控制系统来管理,常用的有Git、SVN等。一份用来跟踪代码所有修订版本的数据库称为代码库。
1.1. 一份基准代码
基准代码和应用之间总是保持一一对应的关系:
- 一旦有多个基准代码,则不能称之为一个应用,而是一个分布式系统。分布式系统中的每个组件都是一个应用,每个应用都可以使用
12-Factor原则进行开发。 - 多个应用共享一份基准代码有悖于
12-Factor原则。解决方法是将共享的代码拆成独立的类库,通过依赖管理去使用它们。
1.2. 多份部署
每个应用只对应一份基准代码,但可以同时存在多份的部署,每份部署相当于运行了一个应用的实例。
多份部署的区别在于:
- 可以存在不同的配置文件对应不同的环境。例如开发环境、预发布环境、生产环境等。
- 可以使用不同的版本。例如开发环境的版本可能高于预发布环境版本,还没同步到预发布环境版本,同理,预发布环境版本可能高于生产环境版本。
2. 依赖
显式声明依赖关系
大多数的编程语言都会提供一个包管理系统或工具,其中包含所有的依赖库,例如Golang的vendor目录存放了该应用的所有依赖包。
12-Factor原则下的应用会通过依赖清单来显式确切地声明所有的依赖项。在运行工程中通过依赖隔离工具来保证应用不会去调用系统中存在但依赖清单中未声明的依赖项。
显式声明依赖项的优点在于可以简化环境配置流程,开发者关注应用的基准代码,而依赖库则由依赖库管理工具来管理和配置。例如,Golang中的包管理工具dep等。
3. 配置
在环境中存储配置
通常,应用的配置在不同的发布环境中(例如:开发、预发布、生产环境)会有很大的差异,其中包括:
- 数据库、Redis等后端服务的配置
- 每份部署特有的配置,例如域名
- 第三方服务的证书等
12-Factor原则要求代码和配置严格分离,而不应该通过代码常量的形式写在代理里面。配置在不同的部署环境中存在大幅差异,但是代码却是完全一致的。
判断一个应用是否正确地将配置排除在代码外,可以看应用的基准代码是否可以立即开源而不担心暴露敏感信息。
12-Factor原则建议将应用的配置存储在环境变量中,环境变量可以方便在不同的部署环境中修改,而不侵入原有的代码。(例如,k8s的大部分代码配置是通过环境变量的方式来传入的)。
12-Factor应用中,环境变量的粒度要足够小且相对独立。当应用需要拓展时,可以平滑过渡。
4. 后端服务
把后端服务当作附加资源
后端服务指程序运行时所需要通过网络调用的各种服务,例如:数据库(MySQL,CouchDB),消息/队列系统(RabbitMQ,Beanstalkd),SMTP 邮件发送服务(Postfix),以及缓存系统(Memcached)。
其中可以根据管理对象分为本地服务(例如本地数据库)和第三方服务(例如Amason S3)。对于12-Factor应用来说都是附加资源,没有区别对待,当其中一份后端服务失效后,可以通过切换到原先备份的后端服务中,而不需要修改代码(但可能需要修改配置)。12-Factor应用与后端服务保持松耦合的关系。
5. 构建,发布,运行
严格分离构建和运行
基准代码转化成一份部署需要经过三个阶段:
- 构建阶段:指代码转化为可执行包的过程。构建过程会使用指定版本的代码,获取依赖项,编译生成二进制文件和资源文件。
- 发布阶段:将构建的结果与当前部署所需的配置结合,并可以在运行环境中使用。
- 运行阶段(运行时):指针对指定的发布版本在执行环境中启动一系列应用程序的进程。
12-Factor应用严格区分构建、发布、运行三个步骤,每一个发布版本对应一个唯一的发布ID,可以使用时间戳或递增的版本序列号。
如果需要修改则需要产生一个新的发布版本,如果需要回退,则回退到之前指定的发布版本。
新代码部署之前,由开发人员触发构建操作,构建阶段可以相对复杂一些,方便错误信息可以展示出来得到妥善处理。运行阶段可以人为触发或自动运行,运行阶段应该保持尽可能少的模块。
6. 进程
以一个或多个无状态进程运行应用
12-Factor应有的进程必须是无状态且无共享的,任何需要持久化的数据存储在后端服务中,例如数据库。
内存区域和磁盘空间可以作为进程的缓存,12-Factor应用不需要关注这些缓存的持久化,而是允许其丢失,例如重启的时候。
进程的二进制文件应该在构建阶段执行编译而不是运行阶段。
当应用使用到粘性Session,即将用户的session数据缓存到进程的内存中,将同一用户的后续请求路由到同一个进程。12-Factor应用反对这种处理方式,而是建议将session的数据保存在redis/memcached带有过期时间的缓存中。
7. 端口绑定
通过端口绑定提供服务
应用通过端口绑定来提供服务,并监听发送至该端口的请求。端口绑定的方式意味着一个应用也可以成为另一个应用的后端服务,例如提供某些API请求。
8. 并发
通过进程模型进行扩展
12-Factor应用中,开发人员可以将不同的工作分配给不同类型进程,例如HTTP请求由web进程来处理,常驻的后台工作由worker进程来处理(k8s的设计中就经常用不同类型的manager来处理不同的任务)。
12-Factor应用的进程具备无共享、水平分区的特性,使得水平扩展较为容易。
12-Factor应用的进程不需要守护进程或是写入PID文件,而是通过进程管理器(例如 systemd)来管理输出流,响应崩溃的进程,以及处理用户触发的重启或关闭超级进程的操作。
9. 易处理
快速启动和优雅终止可最大化健壮性
12-Factor应用的进程是易处理的,即它们可以快速的开启或停止,这样有利于快速部署迭代和弹性伸缩实例。
进程应该追求最小的启动时间,这样可以敏捷发布,增加健壮性,当出现问题可以快速在别的机器部署一个实例。
进程一旦接收到终止信号(SIGTERM)就会优雅终止。优雅终止指停止监听服务的端口,拒绝所有新的请求,并继续执行当前已接收的请求,然后退出。
进程还需在面对突然挂掉的情况下保持健壮性,例如通过任务队列的方式来解决进程突然挂掉而没有完成处理的事情,所以应该设计为任务执行是幂等的,可以被重复执行,重复执行的结果是一致的。
10. 开发环境与线上环境等价
尽可能的保持开发,预发布,线上环境相同
不同的发布环境可能存在以下差异:
- 时间差异:开发到部署的周期较长。
- 人员差异:开发人员只负责开发,运维人员只负责部署。分工过于隔离。
- 工具差异:不同环境的配置和运行环境,使用的后端类型可能存在不同。
应尽量缩小本地与线上的差异,缩短上线周期,开发运维一体化,保证开发环境与线上运行的环境一致(例如,可以通过Docker容器的方式)。
11. 日志
把日志当作事件流
日志应该是事件流的汇总。12-Factor应用本身不考虑存储自己的日志输出流,不去写或管理日志文件,而是通过标准输出(stdout)的方式。
日志的标准输出流可以通过其他组件截获,整合其他的日志输出流,一并发给统一的日志中心处理,用于查看或存档。例如:日志收集开源工具Fluentd。
截获的日志流可以输出至文件,或者在终端实时查看。最重要的是可以发送到Splunk这样的日志索引及分析系统,提供后续的分析统计及监控告警等功能。例如:
- 找出过去一段时间的特殊事件。
- 图形化一个大规模的趋势,如每分钟的请求量。
- 根据用户定义的条件触发告警,如每分钟报错数超过某个警戒线。
12. 管理进程
后台管理任务当作一次性进程运行
开发人员经常需要执行一些管理或维护应用的一次性任务,一次性管理进程应该和常驻进程使用相同的运行环境,开发人员可以通过ssh方式来执行一次性脚本或任务。
参考:
2 - 安装与配置
2.1 - 部署k8s集群
2.1.1 - 使用kubeadm安装生产环境kubernetes
本文为基于
kubeadm搭建生产环境级别高可用的k8s集群。
1. 环境准备
1.0. master硬件配置
参考:
Kubernetes集群Master节点上运行着etcd、kube-apiserver、kube-controller等核心组件,对于Kubernetes集群的稳定性有着至关重要的影响,对于生产环境的集群,必须慎重选择Master规格。Master规格跟集群规模有关,集群规模越大,所需要的Master规格也越高。
说明 :可从多个角度衡量集群规模,例如节点数量、Pod数量、部署频率、访问量。这里简单的认为集群规模就是集群里的节点数量。
对于常见的集群规模,可以参见如下的方式选择Master节点的规格(对于测试环境,规格可以小一些。下面的选择能尽量保证Master负载维持在一个较低的水平上)。
| 节点规模 | Master规格 | 磁盘 |
|---|---|---|
| 1~5个节点 | 4核8 GB(不建议2核4 GB) | |
| 6~20个节点 | 4核16 GB | |
| 21~100个节点 | 8核32 GB | |
| 100~200个节点 | 16核64 GB | |
| 1000个节点 | 32核128GB | 1T SSD |
注意事项:
-
由于Etcd的性能瓶颈,Etcd的数据存储盘尽量选择SSD磁盘。
-
为了实现多机房容灾,可将三台master分布在一个可用区下三个不同机房。(机房之间的网络延迟在10毫秒及以下级别)
-
申请LB来做master节点的负载均衡实现高可用,LB作为apiserver的访问地址。
1.1. 设置防火墙端口策略
生产环境设置k8s节点的iptables端口访问规则。
1.1.1. master节点端口配置
| 协议 | 方向 | 端口范围 | 目的 | 使用者 |
|---|---|---|---|---|
| TCP | 入站 | 6443 | Kubernetes API server | 所有 |
| TCP | 入站 | 2379-2380 | etcd server client API | kube-apiserver, etcd |
| TCP | 入站 | 10250 | Kubelet API | 自身, 控制面 |
| TCP | 入站 | 10259 | kube-scheduler | 自身 |
| TCP | 入站 | 10257 | kube-controller-manager | 自身 |
1.1.2. worker节点端口配置
| 协议 | 方向 | 端口范围 | 目的 | 使用者 |
|---|---|---|---|---|
| TCP | 入站 | 10250 | Kubelet API | 自身, 控制面 |
| TCP | 入站 | 30000-32767 | NodePort Services | 所有 |
添加防火墙iptables规则
master节点开放6443、2379、2380端口。
iptables -A INPUT -p tcp -m multiport --dports 6443,2379,2380,10250 -j ACCEPT
1.2. 关闭swap分区
[root@master ~]#swapoff -a
[root@master ~]#
[root@master ~]# free -m
total used free shared buff/cache available
Mem: 976 366 135 6 474 393
Swap: 0 0 0
# swap 一栏为0,表示已经关闭了swap
1.3. 开启br_netfilter和bridge-nf-call-iptables
参考:https://imroc.cc/post/202105/why-enable-bridge-nf-call-iptables/
# 设置加载br_netfilter模块
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
# 开启bridge-nf-call-iptables ,设置所需的 sysctl 参数,参数在重新启动后保持不变
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
# 应用 sysctl 参数而不重新启动
sudo sysctl --system
2. 安装容器运行时
在所有主机上安装容器运行时,推荐使用containerd为runtime。以下分别是containerd与docker的安装命令。
2.1. Containerd
1、参考:安装containerd
# for ubuntu
apt install -y containerd.io
2、生成默认配置
containerd config default > /etc/containerd/config.toml
3、修改CgroupDriver为systemd
k8s官方推荐使用systemd类型的CgroupDriver。
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
...
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
4、重启containerd
systemctl restart containerd
2.2. Docker
# for ubuntu
apt install -y docker.io
官方建议配置cgroupdriver为systemd。
# 修改docker进程管理器
vi /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"]
}
systemctl daemon-reload && systemctl restart docker
docker info | grep -i cgroup
2.3. Container Socket
| 运行时 | Unix 域套接字 |
|---|---|
| Containerd | unix:///var/run/containerd/containerd.sock |
| CRI-O | unix:///var/run/crio/crio.sock |
| Docker Engine (使用 cri-dockerd) | unix:///var/run/cri-dockerd.sock |
3. 安装kubeadm,kubelet,kubectl
在所有主机上安装kubeadm,kubelet,kubectl。最好版本与需要安装的k8s的版本一致。
# 以Ubuntu系统为例
# 安装仓库依赖
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl
# use google registry
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
# or use aliyun registry
curl -s https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add -
tee /etc/apt/sources.list.d/kubernetes.list <<EOF
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF
# 安装指定版本的kubeadm, kubelet, kubectl
apt-get update
apt-get install -y kubelet=1.24.2-00 kubeadm=1.24.2-00 kubectl=1.24.2-00
# 查询有哪些版本
apt-cache madison kubeadm
离线下载安装
#!/bin/bash
Version=${Version:-1.24.2}
wget https://dl.k8s.io/release/v${Version}/bin/linux/amd64/kubeadm
wget https://dl.k8s.io/release/v${Version}/bin/linux/amd64/kubelet
wget https://dl.k8s.io/release/v${Version}/bin/linux/amd64/kubectl
chmod +x kubeadm kubelet kubectl
cp kubeadm kubelet kubectl /usr/bin/
# add kubelet serivce
cat > /lib/systemd/system/kubelet.service << EOF
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/home/
Wants=network-online.target
After=network-online.target
[Service]
ExecStart=/usr/bin/kubelet
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
mkdir -p /etc/systemd/system/kubelet.service.d
cat > /etc/systemd/system/kubelet.service.d/10-kubeadm.conf << EOF
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
EOF
systemctl daemon-reload
systemctl restart kubelet
4. 配置kubeadm config
参考:
4.1. 配置项说明
4.1.1. 配置类型
kubeadm config支持以下几类配置。
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
可以使用以下命令打印init和join的默认配置。
kubeadm config print init-defaults
kubeadm config print join-defaults
4.1.2. Init配置
kubeadm init配置中只有InitConfiguration 和 ClusterConfiguration 是必须的。
InitConfiguration:
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
bootstrapTokens:
...
nodeRegistration:
...
- bootstrapTokens
- nodeRegistration
- criSocket:runtime的socket
- name:节点名称
- localAPIEndpoint
- advertiseAddress:apiserver的广播IP
- bindPort:k8s控制面安全端口
ClusterConfiguration:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
networking:
...
etcd:
...
apiServer:
extraArgs:
...
extraVolumes:
...
...
-
networking:
- podSubnet:Pod CIDR范围
- serviceSubnet: service CIDR范围
- dnsDomain
-
etcd:
- dataDir:Etcd的数据存储目录
-
apiserver
- certSANs:设置额外的apiserver的域名签名证书
-
imageRepository:镜像仓库
-
controlPlaneEndpoint:控制面LB的域名
-
kubernetesVersion:k8s版本
4.2. Init配置示例
在master节点生成默认配置,并修改配置参数。
kubeadm config print init-defaults > kubeadm-config.yaml
修改配置内容
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- groups:
- system:bootstrappers:kubeadm:default-node-token
token: abcdef.0123456789abcdef
ttl: 24h0m0s
usages:
- signing
- authentication
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 1.2.3.4 # 修改为apiserver的IP 或者去掉localAPIEndpoint则会读取默认IP。
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: node
taints: null
---
apiServer:
certSANs:
- lb.k8s.domain # 添加额外的apiserver的域名
- <vip/lb_ip>
timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta3
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controllerManager: {}
dns: {} # 默认为coredns
etcd:
local:
dataDir: /data/etcd # 修改etcd的存储盘目录
imageRepository: k8s.gcr.io # 修改镜像仓库地址
controlPlaneEndpoint: lb.k8s.domain # 修改控制面域名
kind: ClusterConfiguration
kubernetesVersion: 1.24.0 # k8s 版本
networking:
dnsDomain: cluster.local
serviceSubnet: 10.96.0.0/12
podSubnet: 10.244.0.0/16 # 设置pod的IP范围
scheduler: {}
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd # 设置为systemd
安装完成后可以查看kubeadm config
kubectl get cm -n kube-system kubeadm-config -oyaml
5. 安装Master控制面
提前拉取镜像:
kubeadm config images pull
5.1. 安装master
sudo kubeadm init --config kubeadm-config.yaml --upload-certs --node-name <nodename>
部署参数说明:
-
--control-plane-endpoint:指定控制面(kube-apiserver)的IP或DNS域名地址。
-
--apiserver-advertise-address:kube-apiserver的IP地址。
-
--pod-network-cidr:pod network范围,控制面会自动给每个节点分配CIDR。
-
--service-cidr:service的IP范围,default "10.96.0.0/12"。
-
--kubernetes-version:指定k8s的版本。
-
--image-repository:指定k8s镜像仓库地址。
-
--upload-certs :标志用来将在所有控制平面实例之间的共享证书上传到集群。
-
--node-name:hostname-override,作为节点名称。
执行完毕会输出添加master和添加worker的命令如下:
...
You can now join any number of control-plane node by running the following command on each as a root:
kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866 --control-plane --certificate-key f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07
Please note that the certificate-key gives access to cluster sensitive data, keep it secret!
As a safeguard, uploaded-certs will be deleted in two hours; If necessary, you can use kubeadm init phase upload-certs to reload certs afterward.
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866
5.2. 添加其他master
添加master和添加worker的差别在于添加master多了--control-plane 参数来表示添加类型为master。
kubeadm join <control-plane-endpoint>:6443 --token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane --certificate-key <certificate-key> \
--node-name <nodename>
6. 添加Node节点
kubeadm join <control-plane-endpoint>:6443 --token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--cri-socket /run/containerd/containerd.sock \
--node-name <nodename>
7. 安装网络插件
## 如果安装之后node的状态都改为ready,即为成功
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
kubectl apply -f ./kube-flannel.yml
kubectl get nodes
如果Pod CIDR的网段不是10.244.0.0/16,则需要加flannel配置中的网段更改为与Pod CIDR的网段一致。
7.1. 问题
Warning FailedCreatePodSandBox 4m6s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "300d9b570cc1e23b6335c407b8e7d0ef2c74dc2fe5d7a110678c2dc919c62edf": plugin type="flannel" failed (add): failed to delegate add: failed to set bridge addr: "cni0" already has an IP address different from 10.244.3.1/24
原因:
宿主机节点有cni0网卡,且网卡的IP段与flannel的CIDR网段不同,因此需要删除该网卡,让其重建。
解决:
ifconfig cni0 down
ip link delete cni0
8. 部署dashboard
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.5.0/aio/deploy/recommended.yaml
镜像: kubernetesui/dashboard:v2.5.0
默认端口:8443
9. 重置部署
# kubeadm重置
kubeadm reset
# 清空数据目录
rm -fr /data/etcd
rm -fr /etc/kubernetes
rm -fr ~/.kube/
删除flannel
ifconfig cni0 down
ip link delete cni0
ifconfig flannel.1 down
ip link delete flannel.1
rm -rf /var/lib/cni/
rm -f /etc/cni/net.d/*
10. 问题排查
10.1. kubeadm token过期
问题描述:
添加节点时报以下错误:
[discovery] The cluster-info ConfigMap does not yet contain a JWS signature for token ID "abcdef", will try again
原因:token过期,初始化token后会在24小时候会被master删除。
解决办法:
# 重新生成token
kubeadm token create --print-join-command
kubeadm token list
# kubeadm token create
oumnnc.aqlxuvdbntlvzoiv
# 重新生成hash
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
基于新生成的token重新添加节点。
10.2. 修改kubeadm join的master IP或端口
kubeadm join命令会去kube-public命名空间获取名为cluster-info的ConfigMap。如果需要修改kubeadm join使用的master的IP或端口,则需要修改cluster-info的configmap。
# 查看cluster-info
kubectl -n kube-public get configmaps cluster-info -o yaml
# 修改cluster-info
kubectl -n kube-public edit configmaps cluster-info
修改配置文件中的server字段
clusters:
- cluster:
certificate-authority-data: xxx
server: https://lb.k8s.domain:36443
name: ""
执行kubeadm join的命令时指定新修改的master地址。
参考:
- 利用 kubeadm 创建高可用集群 | Kubernetes
- 使用 kubeadm 创建集群 | Kubernetes
- 高可用拓扑选项 | Kubernetes
- kubeadm init | Kubernetes
- v1.24.2|kubeadm|v1beta3
- Installing kubeadm | Kubernetes
- Ports and Protocols | Kubernetes
- 容器运行时 | Kubernetes
- https://github.com/Mirantis/cri-dockerd
- 配置 cgroup 驱动|Kubernetes
- GitHub: flannel is a network fabric for containers
- 部署和访问 Kubernetes 仪表板(Dashboard) | Kubernetes
2.1.2 - 使用kubespray安装kubernetes
1. 环境准备
1.1. 部署机器
以下机器为虚拟机
| 机器IP | 主机名 | 角色 | 系统版本 | 备注 |
|---|---|---|---|---|
| 172.16.94.140 | kube-master-0 | k8s master | Centos 4.17.14 | 内存:3G |
| 172.16.94.141 | kube-node-41 | k8s node | Centos 4.17.14 | 内存:3G |
| 172.16.94.142 | kube-node-42 | k8s node | Centos 4.17.14 | 内存:3G |
| 172.16.94.135 | 部署管理机 | - |
1.2. 配置管理机
管理机主要用来部署k8s集群,需要安装以下版本的软件,具体可参考:
-
https://github.com/kubernetes-incubator/kubespray#requirements
-
https://github.com/kubernetes-incubator/kubespray/blob/master/requirements.txt
ansible>=2.4.0
jinja2>=2.9.6
netaddr
pbr>=1.6
ansible-modules-hashivault>=3.9.4
hvac
1、安装及配置ansible
- 参考ansible的使用。
- 给部署机器配置SSH的免密登录权限,具体参考ssh免密登录。
2、安装python-netaddr
# 安装pip
yum -y install epel-release
yum -y install python-pip
# 安装python-netaddr
pip install netaddr
3、升级Jinja
# Jinja 2.9 (or newer)
pip install --upgrade jinja2
1.3. 配置部署机器
部署机器即用来运行k8s集群的机器,包括Master和Node。
1、确认系统版本
本文采用centos7的系统,建议将系统内核升级到4.x.x以上。
2、关闭防火墙
systemctl stop firewalld
systemctl disable firewalld
iptables -F
3、关闭swap
Kubespary v2.5.0的版本需要关闭swap,具体参考
- name: Stop if swap enabled
assert:
that: ansible_swaptotal_mb == 0
when: kubelet_fail_swap_on|default(true)
ignore_errors: "{{ ignore_assert_errors }}"
V2.6.0版本去除了swap的检查,具体参考:
执行关闭swap命令swapoff -a。
[root@master ~]#swapoff -a
[root@master ~]#
[root@master ~]# free -m
total used free shared buff/cache available
Mem: 976 366 135 6 474 393
Swap: 0 0 0
# swap 一栏为0,表示已经关闭了swap
4、确认部署机器内存
由于本文采用虚拟机部署,内存可能存在不足的问题,因此将虚拟机内存调整为3G或以上;如果是物理机一般不会有内存不足的问题。具体参考:
- name: Stop if memory is too small for masters
assert:
that: ansible_memtotal_mb >= 1500
ignore_errors: "{{ ignore_assert_errors }}"
when: inventory_hostname in groups['kube-master']
- name: Stop if memory is too small for nodes
assert:
that: ansible_memtotal_mb >= 1024
ignore_errors: "{{ ignore_assert_errors }}"
when: inventory_hostname in groups['kube-node']
1.4. 涉及镜像
Docker版本为17.03.2-ce。
1、Master节点
| 镜像 | 版本 | 大小 | 镜像ID | 备注 |
|---|---|---|---|---|
| gcr.io/google-containers/hyperkube | v1.9.5 | 620 MB | a7e7fdbc5fee | k8s |
| quay.io/coreos/etcd | v3.2.4 | 35.7 MB | 498ffffcfd05 | |
| gcr.io/google_containers/pause-amd64 | 3.0 | 747 kB | 99e59f495ffa | |
| quay.io/calico/node | v2.6.8 | 282 MB | e96a297310fd | calico |
| quay.io/calico/cni | v1.11.4 | 70.8 MB | 4c4cb67d7a88 | calico |
| quay.io/calico/ctl | v1.6.3 | 44.4 MB | 46d3aace8bc6 | calico |
2、Node节点
| 镜像 | 版本 | 大小 | 镜像ID | 备注 |
|---|---|---|---|---|
| gcr.io/google-containers/hyperkube | v1.9.5 | 620 MB | a7e7fdbc5fee | k8s |
| gcr.io/google_containers/pause-amd64 | 3.0 | 747 kB | 99e59f495ffa | |
| quay.io/calico/node | v2.6.8 | 282 MB | e96a297310fd | calico |
| quay.io/calico/cni | v1.11.4 | 70.8 MB | 4c4cb67d7a88 | calico |
| quay.io/calico/ctl | v1.6.3 | 44.4 MB | 46d3aace8bc6 | calico |
| gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64 | 1.14.8 | 40.9 MB | c2ce1ffb51ed | dns |
| gcr.io/google_containers/k8s-dns-sidecar-amd64 | 1.14.8 | 42.2 MB | 6f7f2dc7fab5 | dns |
| gcr.io/google_containers/k8s-dns-kube-dns-amd64 | 1.14.8 | 50.5 MB | 80cc5ea4b547 | dns |
| gcr.io/google_containers/cluster-proportional-autoscaler-amd64 | 1.1.2 | 50.5 MB | 78cf3f492e6b | |
| gcr.io/google_containers/kubernetes-dashboard-amd64 | v1.8.3 | 102 MB | 0c60bcf89900 | dashboard |
| nginx | 1.13 | 109 MB | ae513a47849c | - |
3、说明
- 镜像被墙并且全部镜像下载需要较多时间,建议提前下载到部署机器上。
- hyperkube镜像主要用来运行k8s核心组件(例如kube-apiserver等)。
- 此处使用的网络组件为calico。
2. 部署集群
2.1. 下载kubespary的源码
git clone https://github.com/kubernetes-incubator/kubespray.git
2.2. 编辑配置文件
2.2.1. hosts.ini
hosts.ini主要为部署节点机器信息的文件,路径为:kubespray/inventory/sample/hosts.ini。
cd kubespray
# 复制一份配置进行修改
cp -rfp inventory/sample inventory/k8s
vi inventory/k8s/hosts.ini
例如:
hosts.ini文件可以填写部署机器的登录密码,也可以不填密码而设置ssh的免密登录。
# Configure 'ip' variable to bind kubernetes services on a
# different ip than the default iface
# 主机名 ssh登陆IP ssh用户名 ssh登陆密码 机器IP 子网掩码
kube-master-0 ansible_ssh_host=172.16.94.140 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.140 mask=/24
kube-node-41 ansible_ssh_host=172.16.94.141 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.141 mask=/24
kube-node-42 ansible_ssh_host=172.16.94.142 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.142 mask=/24
# configure a bastion host if your nodes are not directly reachable
# bastion ansible_ssh_host=x.x.x.x
[kube-master]
kube-master-0
[etcd]
kube-master-0
[kube-node]
kube-node-41
kube-node-42
[k8s-cluster:children]
kube-node
kube-master
[calico-rr]
2.2.2. k8s-cluster.yml
k8s-cluster.yml主要为k8s集群的配置文件,路径为:kubespray/inventory/k8s/group_vars/k8s-cluster.yml。该文件可以修改安装的k8s集群的版本,参数为:kube_version: v1.9.5。具体可参考:
2.3. 执行部署操作
涉及文件为cluster.yml。
# 进入主目录
cd kubespray
# 执行部署命令
ansible-playbook -i inventory/k8s/hosts.ini cluster.yml -b -vvv
-vvv 参数表示输出运行日志
如果需要重置可以执行以下命令:
涉及文件为reset.yml。
ansible-playbook -i inventory/k8s/hosts.ini reset.yml -b -vvv
3. 确认部署结果
3.1. ansible的部署结果
ansible命令执行完,出现以下日志,则说明部署成功,否则根据报错内容进行修改。
PLAY RECAP *****************************************************************************
kube-master-0 : ok=309 changed=30 unreachable=0 failed=0
kube-node-41 : ok=203 changed=8 unreachable=0 failed=0
kube-node-42 : ok=203 changed=8 unreachable=0 failed=0
localhost : ok=2 changed=0 unreachable=0 failed=0
以下为部分部署执行日志:
kubernetes/preinstall : Update package management cache (YUM) --------------------23.96s
/root/gopath/src/kubespray/roles/kubernetes/preinstall/tasks/main.yml:121
kubernetes/master : Master | wait for the apiserver to be running ----------------23.44s
/root/gopath/src/kubespray/roles/kubernetes/master/handlers/main.yml:79
kubernetes/preinstall : Install packages requirements ----------------------------20.20s
/root/gopath/src/kubespray/roles/kubernetes/preinstall/tasks/main.yml:203
kubernetes/secrets : Check certs | check if a cert already exists on node --------13.94s
/root/gopath/src/kubespray/roles/kubernetes/secrets/tasks/check-certs.yml:17
gather facts from all instances --------------------------------------------------9.98s
/root/gopath/src/kubespray/cluster.yml:25
kubernetes/node : install | Compare host kubelet with hyperkube container --------9.66s
/root/gopath/src/kubespray/roles/kubernetes/node/tasks/install_host.yml:2
kubernetes-apps/ansible : Kubernetes Apps | Start Resources -----------------------9.27s
/root/gopath/src/kubespray/roles/kubernetes-apps/ansible/tasks/main.yml:37
kubernetes-apps/ansible : Kubernetes Apps | Lay Down KubeDNS Template ------------8.47s
/root/gopath/src/kubespray/roles/kubernetes-apps/ansible/tasks/kubedns.yml:3
download : Sync container ---------------------------------------------------------8.23s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:15
kubernetes-apps/network_plugin/calico : Start Calico resources --------------------7.82s
/root/gopath/src/kubespray/roles/kubernetes-apps/network_plugin/calico/tasks/main.yml:2
download : Download items ---------------------------------------------------------7.67s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6
download : Download items ---------------------------------------------------------7.48s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6
download : Sync container ---------------------------------------------------------7.35s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:15
download : Download items ---------------------------------------------------------7.16s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6
network_plugin/calico : Calico | Copy cni plugins from calico/cni container -------7.10s
/root/gopath/src/kubespray/roles/network_plugin/calico/tasks/main.yml:62
download : Download items ---------------------------------------------------------7.04s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6
download : Download items ---------------------------------------------------------7.01s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6
download : Sync container ---------------------------------------------------------7.00s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:15
download : Download items ---------------------------------------------------------6.98s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6
download : Download items ---------------------------------------------------------6.79s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6
3.2. k8s集群运行结果
1、k8s组件信息
# kubectl get all --namespace=kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ds/calico-node 3 3 3 3 3 <none> 2h
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/kube-dns 2 2 2 2 2h
deploy/kubedns-autoscaler 1 1 1 1 2h
deploy/kubernetes-dashboard 1 1 1 1 2h
NAME DESIRED CURRENT READY AGE
rs/kube-dns-79d99cdcd5 2 2 2 2h
rs/kubedns-autoscaler-5564b5585f 1 1 1 2h
rs/kubernetes-dashboard-69cb58d748 1 1 1 2h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ds/calico-node 3 3 3 3 3 <none> 2h
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/kube-dns 2 2 2 2 2h
deploy/kubedns-autoscaler 1 1 1 1 2h
deploy/kubernetes-dashboard 1 1 1 1 2h
NAME DESIRED CURRENT READY AGE
rs/kube-dns-79d99cdcd5 2 2 2 2h
rs/kubedns-autoscaler-5564b5585f 1 1 1 2h
rs/kubernetes-dashboard-69cb58d748 1 1 1 2h
NAME READY STATUS RESTARTS AGE
po/calico-node-22vsg 1/1 Running 0 2h
po/calico-node-t7zgw 1/1 Running 0 2h
po/calico-node-zqnx8 1/1 Running 0 2h
po/kube-apiserver-kube-master-0 1/1 Running 0 22h
po/kube-controller-manager-kube-master-0 1/1 Running 0 2h
po/kube-dns-79d99cdcd5-f2t6t 3/3 Running 0 2h
po/kube-dns-79d99cdcd5-gw944 3/3 Running 0 2h
po/kube-proxy-kube-master-0 1/1 Running 2 22h
po/kube-proxy-kube-node-41 1/1 Running 3 22h
po/kube-proxy-kube-node-42 1/1 Running 3 22h
po/kube-scheduler-kube-master-0 1/1 Running 0 2h
po/kubedns-autoscaler-5564b5585f-lt9bb 1/1 Running 0 2h
po/kubernetes-dashboard-69cb58d748-wmb9x 1/1 Running 0 2h
po/nginx-proxy-kube-node-41 1/1 Running 3 22h
po/nginx-proxy-kube-node-42 1/1 Running 3 22h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/kube-dns ClusterIP 10.233.0.3 <none> 53/UDP,53/TCP 2h
svc/kubernetes-dashboard ClusterIP 10.233.27.24 <none> 443/TCP 2h
2、k8s节点信息
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube-master-0 Ready master 22h v1.9.5
kube-node-41 Ready node 22h v1.9.5
kube-node-42 Ready node 22h v1.9.5
3、组件健康信息
# kubectl get cs
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health": "true"}
4. k8s集群扩容节点
4.1. 修改hosts.ini文件
如果需要扩容Node节点,则修改hosts.ini文件,增加新增的机器信息。例如,要增加节点机器kube-node-43(IP为172.16.94.143),修改后的文件内容如下:
# Configure 'ip' variable to bind kubernetes services on a
# different ip than the default iface
# 主机名 ssh登陆IP ssh用户名 ssh登陆密码 机器IP 子网掩码
kube-master-0 ansible_ssh_host=172.16.94.140 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.140 mask=/24
kube-node-41 ansible_ssh_host=172.16.94.141 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.141 mask=/24
kube-node-42 ansible_ssh_host=172.16.94.142 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.142 mask=/24
kube-node-43 ansible_ssh_host=172.16.94.143 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.143 mask=/24
# configure a bastion host if your nodes are not directly reachable
# bastion ansible_ssh_host=x.x.x.x
[kube-master]
kube-master-0
[etcd]
kube-master-0
[kube-node]
kube-node-41
kube-node-42
kube-node-43
[k8s-cluster:children]
kube-node
kube-master
[calico-rr]
4.2. 执行扩容命令
涉及文件为scale.yml。
# 进入主目录
cd kubespray
# 执行部署命令
ansible-playbook -i inventory/k8s/hosts.ini scale.yml -b -vvv
4.3. 检查扩容结果
1、ansible的执行结果
PLAY RECAP ***************************************
kube-node-41 : ok=228 changed=11 unreachable=0 failed=0
kube-node-42 : ok=197 changed=6 unreachable=0 failed=0
kube-node-43 : ok=227 changed=69 unreachable=0 failed=0 # 新增Node节点
localhost : ok=2 changed=0 unreachable=0 failed=0
2、k8s的节点信息
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube-master-0 Ready master 1d v1.9.5
kube-node-41 Ready node 1d v1.9.5
kube-node-42 Ready node 1d v1.9.5
kube-node-43 Ready node 1m v1.9.5 #该节点为新增Node节点
可以看到新增的kube-node-43节点已经扩容完成。
3、k8s组件信息
# kubectl get po --namespace=kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE
calico-node-22vsg 1/1 Running 0 10h 172.16.94.140 kube-master-0
calico-node-8fz9x 1/1 Running 2 27m 172.16.94.143 kube-node-43
calico-node-t7zgw 1/1 Running 0 10h 172.16.94.142 kube-node-42
calico-node-zqnx8 1/1 Running 0 10h 172.16.94.141 kube-node-41
kube-apiserver-kube-master-0 1/1 Running 0 1d 172.16.94.140 kube-master-0
kube-controller-manager-kube-master-0 1/1 Running 0 10h 172.16.94.140 kube-master-0
kube-dns-79d99cdcd5-f2t6t 3/3 Running 0 10h 10.233.100.194 kube-node-41
kube-dns-79d99cdcd5-gw944 3/3 Running 0 10h 10.233.107.1 kube-node-42
kube-proxy-kube-master-0 1/1 Running 2 1d 172.16.94.140 kube-master-0
kube-proxy-kube-node-41 1/1 Running 3 1d 172.16.94.141 kube-node-41
kube-proxy-kube-node-42 1/1 Running 3 1d 172.16.94.142 kube-node-42
kube-proxy-kube-node-43 1/1 Running 0 26m 172.16.94.143 kube-node-43
kube-scheduler-kube-master-0 1/1 Running 0 10h 172.16.94.140 kube-master-0
kubedns-autoscaler-5564b5585f-lt9bb 1/1 Running 0 10h 10.233.100.193 kube-node-41
kubernetes-dashboard-69cb58d748-wmb9x 1/1 Running 0 10h 10.233.107.2 kube-node-42
nginx-proxy-kube-node-41 1/1 Running 3 1d 172.16.94.141 kube-node-41
nginx-proxy-kube-node-42 1/1 Running 3 1d 172.16.94.142 kube-node-42
nginx-proxy-kube-node-43 1/1 Running 0 26m 172.16.94.143 kube-node-43
5. 部署高可用集群
将hosts.ini文件中的master和etcd的机器增加到多台,执行部署命令。
ansible-playbook -i inventory/k8s/hosts.ini cluster.yml -b -vvv
例如:
# Configure 'ip' variable to bind kubernetes services on a
# different ip than the default iface
# 主机名 ssh登陆IP ssh用户名 ssh登陆密码 机器IP 子网掩码
kube-master-0 ansible_ssh_host=172.16.94.140 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.140 mask=/24
kube-master-1 ansible_ssh_host=172.16.94.144 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.144 mask=/24
kube-master-2 ansible_ssh_host=172.16.94.145 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.145 mask=/24
kube-node-41 ansible_ssh_host=172.16.94.141 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.141 mask=/24
kube-node-42 ansible_ssh_host=172.16.94.142 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.142 mask=/24
kube-node-43 ansible_ssh_host=172.16.94.143 ansible_ssh_user=root ansible_ssh_pass=123 ip=172.16.94.143 mask=/24
# configure a bastion host if your nodes are not directly reachable
# bastion ansible_ssh_host=x.x.x.x
[kube-master]
kube-master-0
kube-master-1
kube-master-2
[etcd]
kube-master-0
kube-master-1
kube-master-2
[kube-node]
kube-node-41
kube-node-42
kube-node-43
[k8s-cluster:children]
kube-node
kube-master
[calico-rr]
6. 升级k8s集群
选择对应的k8s版本信息,执行升级命令。涉及文件为upgrade-cluster.yml。
ansible-playbook upgrade-cluster.yml -b -i inventory/k8s/hosts.ini -e kube_version=v1.10.4 -vvv
7. troubles shooting
在使用kubespary部署k8s集群时,主要遇到以下报错。
7.1. python-netaddr未安装
- 报错内容:
fatal: [node1]: FAILED! => {"failed": true, "msg": "The ipaddr filter requires python-netaddr be installed on the ansible controller"}
- 解决方法:
需要安装 python-netaddr,具体参考上述[环境准备]内容。
7.2. swap未关闭
- 报错内容:
fatal: [kube-master-0]: FAILED! => {
"assertion": "ansible_swaptotal_mb == 0",
"changed": false,
"evaluated_to": false
}
fatal: [kube-node-41]: FAILED! => {
"assertion": "ansible_swaptotal_mb == 0",
"changed": false,
"evaluated_to": false
}
fatal: [kube-node-42]: FAILED! => {
"assertion": "ansible_swaptotal_mb == 0",
"changed": false,
"evaluated_to": false
}
- 解决方法:
所有部署机器执行swapoff -a关闭swap,具体参考上述[环境准备]内容。
7.3. 部署机器内存过小
- 报错内容:
TASK [kubernetes/preinstall : Stop if memory is too small for masters] *********************************************************************************************************************************************************************************************************
task path: /root/gopath/src/kubespray/roles/kubernetes/preinstall/tasks/verify-settings.yml:52
Friday 10 August 2018 21:50:26 +0800 (0:00:00.940) 0:01:14.088 *********
fatal: [kube-master-0]: FAILED! => {
"assertion": "ansible_memtotal_mb >= 1500",
"changed": false,
"evaluated_to": false
}
TASK [kubernetes/preinstall : Stop if memory is too small for nodes] ***********************************************************************************************************************************************************************************************************
task path: /root/gopath/src/kubespray/roles/kubernetes/preinstall/tasks/verify-settings.yml:58
Friday 10 August 2018 21:50:27 +0800 (0:00:00.570) 0:01:14.659 *********
fatal: [kube-node-41]: FAILED! => {
"assertion": "ansible_memtotal_mb >= 1024",
"changed": false,
"evaluated_to": false
}
fatal: [kube-node-42]: FAILED! => {
"assertion": "ansible_memtotal_mb >= 1024",
"changed": false,
"evaluated_to": false
}
to retry, use: --limit @/root/gopath/src/kubespray/cluster.retry
- 解决方法:
调大所有部署机器的内存,本示例中调整为3G或以上。
7.4. kube-scheduler组件运行失败
kube-scheduler组件运行失败,导致http://localhost:10251/healthz调用失败。
- 报错内容:
FAILED - RETRYING: Master | wait for kube-scheduler (1 retries left).
FAILED - RETRYING: Master | wait for kube-scheduler (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 60, "changed": false, "content": "", "failed": true, "msg": "Status code was not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "http://localhost:10251/healthz"}
- 解决方法:
可能是内存不足导致,本示例中调大了部署机器的内存。
7.5. docker安装包冲突
- 报错内容:
failed: [k8s-node-1] (item={u'name': u'docker-engine-1.13.1-1.el7.centos'}) => {
"attempts": 4,
"changed": false,
...
"item": {
"name": "docker-engine-1.13.1-1.el7.centos"
},
"msg": "Error: docker-ce-selinux conflicts with 2:container-selinux-2.66-1.el7.noarch\n",
"rc": 1,
"results": [
"Loaded plugins: fastestmirror\nLoading mirror speeds from cached hostfile\n * elrepo: mirrors.tuna.tsinghua.edu.cn\n * epel: mirrors.tongji.edu.cn\nPackage docker-engine is obsoleted by docker-ce, trying to install docker-ce-17.03.2.ce-1.el7.centos.x86_64 instead\nResolving Dependencies\n--> Running transaction check\n---> Package docker-ce.x86_64 0:17.03.2.ce-1.el7.centos will be installed\n--> Processing Dependency: docker-ce-selinux >= 17.03.2.ce-1.el7.centos for package: docker-ce-17.03.2.ce-1.el7.centos.x86_64\n--> Processing Dependency: libltdl.so.7()(64bit) for package: docker-ce-17.03.2.ce-1.el7.centos.x86_64\n--> Running transaction check\n---> Package docker-ce-selinux.noarch 0:17.03.2.ce-1.el7.centos will be installed\n---> Package libtool-ltdl.x86_64 0:2.4.2-22.el7_3 will be installed\n--> Processing Conflict: docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch conflicts docker-selinux\n--> Restarting Dependency Resolution with new changes.\n--> Running transaction check\n---> Package container-selinux.noarch 2:2.55-1.el7 will be updated\n---> Package container-selinux.noarch 2:2.66-1.el7 will be an update\n--> Processing Conflict: docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch conflicts docker-selinux\n--> Finished Dependency Resolution\n You could try using --skip-broken to work around the problem\n You could try running: rpm -Va --nofiles --nodigest\n"
]
}
- 解决方法:
卸载旧的docker版本,由kubespary自动安装。
sudo yum remove -y docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-selinux \
docker-engine-selinux \
docker-engine
参考文章:
2.1.3 - 使用minikube安装kubernetes
以下内容基于Linux系统,特别为Ubuntu系统
1. 安装kubectl
curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
下载指定版本,例如下载v1.9.0版本
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.9.0/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
2. 安装minikube
minikube的源码地址:https://github.com/kubernetes/minikube
2.1 安装minikube
以下命令为安装latest版本的minikube。
curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/
安装指定版本可到https://github.com/kubernetes/minikube/releases下载对应版本。
例如:以下为安装v0.28.2版本
curl -Lo minikube https://storage.googleapis.com/minikube/releases/v0.28.2/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/
2.2 minikube命令帮助
Minikube is a CLI tool that provisions and manages single-node Kubernetes clusters optimized for development workflows.
Usage:
minikube [command]
Available Commands:
addons Modify minikube's kubernetes addons
cache Add or delete an image from the local cache.
completion Outputs minikube shell completion for the given shell (bash or zsh)
config Modify minikube config
dashboard Opens/displays the kubernetes dashboard URL for your local cluster
delete Deletes a local kubernetes cluster
docker-env Sets up docker env variables; similar to '$(docker-machine env)'
get-k8s-versions Gets the list of Kubernetes versions available for minikube when using the localkube bootstrapper
ip Retrieves the IP address of the running cluster
logs Gets the logs of the running localkube instance, used for debugging minikube, not user code
mount Mounts the specified directory into minikube
profile Profile sets the current minikube profile
service Gets the kubernetes URL(s) for the specified service in your local cluster
ssh Log into or run a command on a machine with SSH; similar to 'docker-machine ssh'
ssh-key Retrieve the ssh identity key path of the specified cluster
start Starts a local kubernetes cluster
status Gets the status of a local kubernetes cluster
stop Stops a running local kubernetes cluster
update-check Print current and latest version number
update-context Verify the IP address of the running cluster in kubeconfig.
version Print the version of minikube
Flags:
--alsologtostderr log to standard error as well as files
-b, --bootstrapper string The name of the cluster bootstrapper that will set up the kubernetes cluster. (default "localkube")
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log_dir string If non-empty, write log files in this directory
--loglevel int Log level (0 = DEBUG, 5 = FATAL) (default 1)
--logtostderr log to standard error instead of files
-p, --profile string The name of the minikube VM being used.
This can be modified to allow for multiple minikube instances to be run independently (default "minikube")
--stderrthreshold severity logs at or above this threshold go to stderr (default 2)
-v, --v Level log level for V logs
--vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
Use "minikube [command] --help" for more information about a command.
3. 使用minikube安装k8s集群
3.1. minikube start
可以以Docker的方式运行k8s的组件,但需要先安装Docker(可参考Docker安装),启动参数使用--vm-driver=none。
minikube start --vm-driver=none
例如:
root@ubuntu:~# minikube start --vm-driver=none
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Downloading kubeadm v1.10.0
Downloading kubelet v1.10.0
^[[DFinished Downloading kubelet v1.10.0
Finished Downloading kubeadm v1.10.0
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
===================
WARNING: IT IS RECOMMENDED NOT TO RUN THE NONE DRIVER ON PERSONAL WORKSTATIONS
The 'none' driver will run an insecure kubernetes apiserver as root that may leave the host vulnerable to CSRF attacks
When using the none driver, the kubectl config and credentials generated will be root owned and will appear in the root home directory.
You will need to move the files to the appropriate location and then set the correct permissions. An example of this is below:
sudo mv /root/.kube $HOME/.kube # this will write over any previous configuration
sudo chown -R $USER $HOME/.kube
sudo chgrp -R $USER $HOME/.kube
sudo mv /root/.minikube $HOME/.minikube # this will write over any previous configuration
sudo chown -R $USER $HOME/.minikube
sudo chgrp -R $USER $HOME/.minikube
This can also be done automatically by setting the env var CHANGE_MINIKUBE_NONE_USER=true
Loading cached images from config file.
安装指定版本的kubernetes集群
# 查阅版本
minikube get-k8s-versions
# 选择版本启动
minikube start --kubernetes-version v1.7.3 --vm-driver=none
3.2. minikube status
$ minikube status
minikube: Running
cluster: Running
kubectl: Correctly Configured: pointing to minikube-vm at 172.16.94.139
3.3. minikube stop
minikube stop 命令可以用来停止集群。 该命令会关闭 minikube 虚拟机,但将保留所有集群状态和数据。 再次启动集群将恢复到之前的状态。
3.4. minikube delete
minikube delete 命令可以用来删除集群。 该命令将关闭并删除 minikube 虚拟机。没有数据或状态会被保存下来。
4. 查看部署结果
4.1. 部署组件
root@ubuntu:~# kubectl get all --namespace=kube-system
NAME READY STATUS RESTARTS AGE
pod/etcd-minikube 1/1 Running 0 38m
pod/kube-addon-manager-minikube 1/1 Running 0 38m
pod/kube-apiserver-minikube 1/1 Running 1 39m
pod/kube-controller-manager-minikube 1/1 Running 0 38m
pod/kube-dns-86f4d74b45-bdfnx 3/3 Running 0 38m
pod/kube-proxy-dqdvg 1/1 Running 0 38m
pod/kube-scheduler-minikube 1/1 Running 0 38m
pod/kubernetes-dashboard-5498ccf677-c2gnh 1/1 Running 0 38m
pod/storage-provisioner 1/1 Running 0 38m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 38m
service/kubernetes-dashboard NodePort 10.104.48.227 <none> 80:30000/TCP 38m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/kube-proxy 1 1 1 1 1 <none> 38m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.apps/kube-dns 1 1 1 1 38m
deployment.apps/kubernetes-dashboard 1 1 1 1 38m
NAME DESIRED CURRENT READY AGE
replicaset.apps/kube-dns-86f4d74b45 1 1 1 38m
replicaset.apps/kubernetes-dashboard-5498ccf677 1 1 1 38m
4.2. dashboard
通过访问ip:port,例如:http://172.16.94.139:30000/,可以访问k8s的dashboard控制台。
<img src="http://res.cloudinary.com/dqxtn0ick/image/upload/v1533695750/article/kubernetes/arch/dashboard.png" width = "100%"/>
5. troubleshooting
5.1. 没有安装VirtualBox
[root@minikube ~]# minikube start
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Downloading Minikube ISO
160.27 MB / 160.27 MB [============================================] 100.00% 0s
E0727 15:47:08.655647 9407 start.go:174] Error starting host: Error creating host: Error executing step: Running precreate checks.
: VBoxManage not found. Make sure VirtualBox is installed and VBoxManage is in the path.
Retrying.
E0727 15:47:08.656994 9407 start.go:180] Error starting host: Error creating host: Error executing step: Running precreate checks.
: VBoxManage not found. Make sure VirtualBox is installed and VBoxManage is in the path
================================================================================
An error has occurred. Would you like to opt in to sending anonymized crash
information to minikube to help prevent future errors?
To opt out of these messages, run the command:
minikube config set WantReportErrorPrompt false
================================================================================
Please enter your response [Y/n]:
解决方法,先安装VirtualBox。
5.2. 没有安装Docker
[root@minikube ~]# minikube start --vm-driver=none
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
E0727 15:56:54.936706 9441 start.go:174] Error starting host: Error creating host: Error executing step: Running precreate checks.
: docker cannot be found on the path for this machine. A docker installation is a requirement for using the none driver: exec: "docker": executable file not found in $PATH.
Retrying.
E0727 15:56:54.938930 9441 start.go:180] Error starting host: Error creating host: Error executing step: Running precreate checks.
: docker cannot be found on the path for this machine. A docker installation is a requirement for using the none driver: exec: "docker": executable file not found in $PATH
解决方法,先安装Docker。
文章参考:
https://github.com/kubernetes/minikube
https://kubernetes.io/docs/setup/minikube/
2.1.4 - 使用kind安装kubernetes
1. 安装kind
On mac or linux
curl -Lo ./kind "https://github.com/kubernetes-sigs/kind/releases/download/v0.7.0/kind-$(uname)-amd64"
chmod +x ./kind
mv ./kind /some-dir-in-your-PATH/kind
2. 创建k8s集群
$ kind create cluster
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.17.0) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Not sure what to do next? 😅 Check out https://kind.sigs.k8s.io/docs/user/quick-start/
查看集群信息
$ kubectl cluster-info --context kind-kind
Kubernetes master is running at https://127.0.0.1:32768
KubeDNS is running at https://127.0.0.1:32768/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
查看node
$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
kind-control-plane Ready master 35h v1.17.0 172.17.0.2 <none> Ubuntu 19.10 3.10.107-1-tlinux2_kvm_guest-0049 containerd://1.3.2
查看pod
$ kubectl get po --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-6955765f44-lqk9v 1/1 Running 0 35h 10.244.0.4 kind-control-plane <none> <none>
kube-system coredns-6955765f44-zpsmc 1/1 Running 0 35h 10.244.0.3 kind-control-plane <none> <none>
kube-system etcd-kind-control-plane 1/1 Running 0 35h 172.17.0.2 kind-control-plane <none> <none>
kube-system kindnet-8mt7d 1/1 Running 0 35h 172.17.0.2 kind-control-plane <none> <none>
kube-system kube-apiserver-kind-control-plane 1/1 Running 0 35h 172.17.0.2 kind-control-plane <none> <none>
kube-system kube-controller-manager-kind-control-plane 1/1 Running 0 35h 172.17.0.2 kind-control-plane <none> <none>
kube-system kube-proxy-5w25s 1/1 Running 0 35h 172.17.0.2 kind-control-plane <none> <none>
kube-system kube-scheduler-kind-control-plane 1/1 Running 0 35h 172.17.0.2 kind-control-plane <none> <none>
local-path-storage local-path-provisioner-7745554f7f-dckzr 1/1 Running 0 35h 10.244.0.2 kind-control-plane <none> <none>
docker ps
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
93b291f99dd4 kindest/node:v1.17.0 "/usr/local/bin/entr…" 2 minutes ago Up 2 minutes 127.0.0.1:32768->6443/tcp kind-control-plane
3. kindest/node容器内进程
$ docker exec -it 93b291f99dd4 bash
root@kind-control-plane:/# ps auxw
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.1 0.0 19512 7480 ? Ss 03:18 0:00 /sbin/init
root 105 0.0 0.0 26396 7344 ? S<s 03:18 0:00 /lib/systemd/systemd-journald
root 141 2.3 0.3 2374736 51564 ? Ssl 03:18 0:06 /usr/local/bin/containerd
root 325 0.0 0.0 112540 5036 ? Sl 03:18 0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3f415d609e15ef12b9f53557891c311c156b912d5a326544a25c8b29cfa9d366 -address /run/containerd/containerd.sock
root 346 0.0 0.0 1012 4 ? Ss 03:18 0:00 /pause
root 370 0.0 0.0 112540 5108 ? Sl 03:18 0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 1e1f3eed09f701fb621325e7b9e96d1c3de60ebd3bd64e0aec376e9490cf0e57 -address /run/containerd/containerd.sock
root 397 0.0 0.0 112540 4684 ? Sl 03:18 0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id c5e451089a1a5b3dfb2cc68ee27ac7d414285be55ecfdc5bd59180fbfbc7df2e -address /run/containerd/containerd.sock
root 424 0.0 0.0 112540 4924 ? Sl 03:18 0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 81e35f29ac8c2dda344125a10e3791be7ccf788a88f1efbc3397fa319f02881f -address /run/containerd/containerd.sock
root 443 0.0 0.0 1012 4 ? Ss 03:18 0:00 /pause
root 458 0.0 0.0 1012 4 ? Ss 03:18 0:00 /pause
root 465 0.0 0.0 1012 4 ? Ss 03:18 0:00 /pause
root 548 0.7 0.1 145500 27724 ? Ssl 03:18 0:02 kube-scheduler --authentication-kubeconfig=/etc/kubernetes/scheduler.conf --authorization-kubeconfig=/etc/kubernetes/scheduler.conf --bind-address=127.0.0.1 --kubeconfig=/etc/kubernetes/scheduler.conf --leader-elect=true
root 589 1.0 0.3 159536 54384 ? Ssl 03:18 0:02 kube-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf --bind-address=127.0.0.1 --client-ca-file=/etc/kubernetes/pki/ca.cr
root 613 3.8 1.6 445780 273484 ? Ssl 03:18 0:10 kube-apiserver --advertise-address=172.17.0.2 --allow-privileged=true --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/pki/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/
root 660 1.4 0.2 10613604 37448 ? Ssl 03:18 0:04 etcd --advertise-client-urls=https://172.17.0.2:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://172.17.0.2:2380 --initial-cluster=kind-control-plane=https://172.
root 718 1.3 0.3 2084848 52772 ? Ssl 03:18 0:03 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock --fail
root 876 0.0 0.0 112540 5084 ? Sl 03:18 0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id adfbea8fec5ac6986407291f5bfc5aecead176954e5dabbe1517b98dd77bf78b -address /run/containerd/containerd.sock
root 893 0.0 0.0 112540 4796 ? Sl 03:18 0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 53bdce023626b60ffaa0548b5888e457dc9c3bc45c7808a385dd0f63dcc90327 -address /run/containerd/containerd.sock
root 924 0.0 0.0 1012 4 ? Ss 03:18 0:00 /pause
root 931 0.0 0.0 1012 4 ? Ss 03:18 0:00 /pause
root 1000 0.0 0.0 127616 11100 ? Ssl 03:18 0:00 /bin/kindnetd
root 1017 0.0 0.1 141060 19420 ? Ssl 03:18 0:00 /usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/config.conf --hostname-override=kind-control-plane
root 1066 0.0 0.0 0 0 ? Z 03:18 0:00 [iptables-nft-sa] <defunct>
root 1080 0.0 0.0 0 0 ? Z 03:18 0:00 [iptables-nft-sa] <defunct>
root 1241 0.0 0.0 112540 5156 ? Sl 03:19 0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 5cbd7bbe186cf5847786c7a03aa4c6f82e6c805d0a189f0f3e8fb1750594260d -address /run/containerd/containerd.sock
root 1262 0.0 0.0 1012 4 ? Ss 03:19 0:00 /pause
root 1303 0.1 0.0 134372 14088 ? Ssl 03:19 0:00 local-path-provisioner --debug start --helper-image k8s.gcr.io/debian-base:v2.0.0 --config /etc/config/config.json
root 1411 0.0 0.0 112540 4876 ? Sl 03:19 0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 196b440345cb5ef47a6c31222323d35bbfef85d1d79c149ec0e3a6e22022a5f0 -address /run/containerd/containerd.sock
root 1437 0.0 0.0 1012 4 ? Ss 03:19 0:00 /pause
root 1450 0.0 0.0 112540 4380 ? Sl 03:19 0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id de7bdf052083978c78708383f842567d4fb38adff22a56792437a4de82425afe -address /run/containerd/containerd.sock
root 1480 0.0 0.0 1012 4 ? Ss 03:19 0:00 /pause
root 1530 0.1 0.1 144324 19056 ? Ssl 03:19 0:00 /coredns -conf /etc/coredns/Corefile
root 1531 0.1 0.1 144580 19204 ? Ssl 03:19 0:00 /coredns -conf /etc/coredns/Corefile
参考:
2.2 - k8s证书及秘钥
1. 证书分类
-
服务器证书:server cert,用于客户端验证服务端的身份。
-
客户端证书:client cert,用于服务端验证客户端的身份。
-
对等证书:peer cert(既是
server cert又是client cert),用户成员之间的身份验证,例如 etcd。
1.1. k8s集群的证书分类
etcd节点:需要标识自己服务的server cert,也需要client cert与etcd集群其他节点交互,因此需要一个对等证书。master节点:需要标识 apiserver服务的server cert,也需要client cert连接etcd集群,也需要一个对等证书。kubelet:需要标识自己服务的server cert,也需要client cert请求apiserver,也使用一个对等证书。kubectl、kube-proxy、calico:需要client证书。
2. CA证书及秘钥
目录:/etc/kubernetes/ssl
| 分类 | 证书/秘钥 | 说明 | 组件 |
|---|---|---|---|
| ca | ca-key.pem | ||
| ca.pem | |||
| ca.csr | |||
| Kubernetes | kubernetes-key.pem | ||
| kubernetes.pem | |||
| kubernetes.csr | |||
| Admin | admin-key.pem | ||
| admin.pem | |||
| admin.csr | |||
| Kubelet | kubelet.crt | ||
| kubelet.key |
配置文件
| 分类 | 证书/秘钥 | 说明 |
|---|---|---|
| ca | ca-config.json | |
| ca-csr.json | ||
| Kubernetes | kubernetes-csr.json | |
| Admin | admin-csr.json | |
| Kube-proxy | kube-proxy-csr.json |
3. cfssl工具
安装cfssl:
# 下载cfssl
$ curl https://pkg.cfssl.org/R1.2/cfssl_linux-amd64 -o /usr/local/bin/cfssl
$ curl https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64 -o /usr/local/bin/cfssljson
$ curl https://pkg.cfssl.org/R1.2/cfssl-certinfo_linux-amd64 -o /usr/local/bin/cfssl-certinfo
# 添加可执行权限
$ chmod +x /usr/local/bin/cfssl /usr/local/bin/cfssljson /usr/local/bin/cfssl-certinfo
4. 创建 CA (Certificate Authority)
4.1. 配置源文件
创建 CA 配置文件
ca-config.json
cat << EOF > ca-config.json
{
"signing": {
"default": {
"expiry": "87600h"
},
"profiles": {
"kubernetes": {
"usages": [
"signing",
"key encipherment",
"server auth",
"client auth"
],
"expiry": "876000h"
}
}
}
}
EOF
参数说明
ca-config.json:可以定义多个 profiles,分别指定不同的过期时间、使用场景等参数;后续在签名证书时使用某个 profile;signing:表示该证书可用于签名其它证书;生成的 ca.pem 证书中CA=TRUE;server auth:表示client可以用该 CA 对server提供的证书进行验证;client auth:表示server可以用该CA对client提供的证书进行验证;
创建 CA 证书签名请求
ca-csr.json
cat << EOF > ca-csr.json
{
"CN": "kubernetes",
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"C": "CN",
"ST": "ShenZhen",
"L": "ShenZhen",
"O": "k8s",
"OU": "System"
}
]
}
EOF
参数说明
ca-csr.json的参数
- CN:
Common Name,kube-apiserver 从证书中提取该字段作为请求的用户名 (User Name);浏览器使用该字段验证网站是否合法;
names中的字段:
- C : country,国家
- ST: state,州或省份
- L:location,城市
- O:organization,组织,kube-apiserver 从证书中提取该字段作为请求用户所属的组 (Group)
- OU:organization unit
4.2. 执行命令
cfssl gencert -initca ca-csr.json | cfssljson -bare ca
输出如下:
# cfssl gencert -initca ca-csr.json | cfssljson -bare ca
2019/12/13 14:35:52 [INFO] generating a new CA key and certificate from CSR
2019/12/13 14:35:52 [INFO] generate received request
2019/12/13 14:35:52 [INFO] received CSR
2019/12/13 14:35:52 [INFO] generating key: rsa-2048
2019/12/13 14:35:52 [INFO] encoded CSR
2019/12/13 14:35:52 [INFO] signed certificate with serial number 248379771349454958117219047414671162179070747780
生成以下文件:
# 生成文件
-rw-r--r-- 1 root root 1005 12月 13 11:32 ca.csr
-rw------- 1 root root 1675 12月 13 11:32 ca-key.pem
-rw-r--r-- 1 root root 1363 12月 13 11:32 ca.pem
# 配置源文件
-rw-r--r-- 1 root root 293 12月 13 11:31 ca-config.json
-rw-r--r-- 1 root root 210 12月 13 11:31 ca-csr.json
5. 创建 kubernetes 证书
5.1. 配置源文件
创建 kubernetes 证书签名请求文件kubernetes-csr.json。
cat << EOF > kubernetes-csr.json
{
"CN": "kubernetes",
"hosts": [
"127.0.0.1",
"<MASTER_IP>",
"<MASTER_CLUSTER_IP>",
"kubernetes",
"kubernetes.default",
"kubernetes.default.svc",
"kubernetes.default.svc.cluster",
"kubernetes.default.svc.cluster.local"
],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [{
"C": "<country>",
"ST": "<state>",
"L": "<city>",
"O": "<organization>",
"OU": "<organization unit>"
}]
}
EOF
参数说明:
MASTER_IP:master节点的IP或域名MASTER_CLUSTER_IP:kube-apiserver指定的service-cluster-ip-range网段的第一个IP,例如(10.254.0.1)。
5.2. 执行命令
$ cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kubernetes-csr.json | cfssljson -bare kubernetes
输出如下:
# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kubernetes-csr.json | cfssljson -bare kubernetes
2019/12/13 14:40:28 [INFO] generate received request
2019/12/13 14:40:28 [INFO] received CSR
2019/12/13 14:40:28 [INFO] generating key: rsa-2048
2019/12/13 14:40:28 [INFO] encoded CSR
2019/12/13 14:40:28 [INFO] signed certificate with serial number 392795299385191732458211386861696542628305189374
2019/12/13 14:40:28 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").
生成以下文件:
# 生成文件
-rw-r--r-- 1 root root 1269 12月 13 14:40 kubernetes.csr
-rw------- 1 root root 1679 12月 13 14:40 kubernetes-key.pem
-rw-r--r-- 1 root root 1643 12月 13 14:40 kubernetes.pem
# 配置源文件
-rw-r--r-- 1 root root 580 12月 13 14:40 kubernetes-csr.json
6. 创建 admin 证书
6.1. 配置源文件
创建 admin 证书签名请求文件 admin-csr.json:
cat << EOF > admin-csr.json
{
"CN": "admin",
"hosts": [],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"C": "CN",
"ST": "ShenZhen",
"L": "ShenZhen",
"O": "system:masters",
"OU": "System"
}
]
}
EOF
6.2. 执行命令
$ cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes admin-csr.json | cfssljson -bare admin
输出如下:
# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes admin-csr.json | cfssljson -bare admin
2019/12/13 14:52:37 [INFO] generate received request
2019/12/13 14:52:37 [INFO] received CSR
2019/12/13 14:52:37 [INFO] generating key: rsa-2048
2019/12/13 14:52:37 [INFO] encoded CSR
2019/12/13 14:52:37 [INFO] signed certificate with serial number 465422983473444224050765004141217688748259757371
2019/12/13 14:52:37 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").
生成文件
# 生成文件
-rw-r--r-- 1 root root 1013 12月 13 14:52 admin.csr
-rw------- 1 root root 1675 12月 13 14:52 admin-key.pem
-rw-r--r-- 1 root root 1407 12月 13 14:52 admin.pem
# 配置源文件
-rw-r--r-- 1 root root 231 12月 13 14:49 admin-csr.json
7. 创建 kube-proxy 证书
7.1. 配置源文件
创建 kube-proxy 证书签名请求文件 kube-proxy-csr.json:
cat << EOF > kube-proxy-csr.json
{
"CN": "system:kube-proxy",
"hosts": [],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"C": "CN",
"ST": "BeiJing",
"L": "BeiJing",
"O": "k8s",
"OU": "System"
}
]
}
EOF
7.2. 执行命令
$ cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kube-proxy-csr.json | cfssljson -bare kube-proxy
输出如下:
# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kube-proxy-csr.json | cfssljson -bare kube-proxy
2019/12/13 19:37:48 [INFO] generate received request
2019/12/13 19:37:48 [INFO] received CSR
2019/12/13 19:37:48 [INFO] generating key: rsa-2048
2019/12/13 19:37:48 [INFO] encoded CSR
2019/12/13 19:37:48 [INFO] signed certificate with serial number 526712749765692443642491255093816136154324531741
2019/12/13 19:37:48 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").
生成文件:
# 生成文件
-rw-r--r-- 1 root root 1009 12月 13 19:37 kube-proxy.csr
-rw------- 1 root root 1675 12月 13 19:37 kube-proxy-key.pem
-rw-r--r-- 1 root root 1407 12月 13 19:37 kube-proxy.pem
# 配置源文件
-rw-r--r-- 1 root root 230 12月 13 19:37 kube-proxy-csr.json
8. 校验证书
openssl x509 -noout -text -in kubernetes.pem
输出如下:
# openssl x509 -noout -text -in kubernetes.pem
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
44:cd:8c:e6:a4:60:ff:3f:09:af:02:e7:68:5e:f2:0f:e6:a0:39:fe
Signature Algorithm: sha256WithRSAEncryption
Issuer: C=CN, ST=ShenZhen, L=ShenZhen, O=k8s, OU=System, CN=kubernetes
Validity
Not Before: Dec 13 06:35:00 2019 GMT
Not After : Nov 19 06:35:00 2119 GMT
Subject: C=CN, ST=ShenZhen, L=ShenZhen, O=k8s, OU=System, CN=kubernetes
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
Modulus:
00:d7:91:4f:90:56:fb:ab:a9:de:c4:98:9e:d7:e6:
45:db:5a:14:9a:76:78:6a:4c:db:3c:47:3c:e7:1c:
3c:37:4e:8a:cf:9c:a1:8a:4c:51:4c:cd:45:b0:03:
87:06:b9:20:2c:3a:51:f9:21:55:1c:90:7c:f8:93:
bc:6a:48:05:3d:8b:74:fd:f2:f1:e6:5e:ad:b4:a8:
f6:6d:f9:63:9e:e4:b4:cc:68:9e:90:d7:ef:de:ce:
c1:1d:1b:68:59:68:5e:5f:7d:5c:f3:49:4f:18:72:
be:b5:c8:af:e2:8d:34:9c:d2:68:b7:8c:26:69:cc:
a5:f4:ca:69:2d:d7:21:f5:19:2e:b2:b5:97:16:87:
9f:9c:fd:01:97:c2:0e:20:b4:88:27:9a:37:9a:af:
0a:cf:82:4f:26:24:cb:07:ac:8c:b1:34:20:42:22:
00:b2:b0:98:c5:53:01:fb:32:aa:15:1b:7e:39:44:
ae:af:6e:c3:65:96:f6:38:7a:87:37:d0:31:63:d8:
a4:15:13:f2:56:da:e6:09:45:2b:46:2c:cb:63:db:
f7:ba:7f:44:0a:36:39:7c:cc:5b:42:e5:56:c7:7f:
dd:64:5c:f2:4a:af:d3:a9:d1:6e:06:27:57:09:4d:
db:08:62:87:66:c8:2c:36:00:41:f1:90:f6:5f:68:
20:3d
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Server Authentication, TLS Web Client Authentication
X509v3 Basic Constraints: critical
CA:FALSE
X509v3 Subject Key Identifier:
3D:3F:FA:B8:36:D7:FE:B1:59:BE:B1:F5:C1:5D:88:3D:BC:78:9F:87
X509v3 Authority Key Identifier:
keyid:40:A2:D4:30:22:12:2E:C2:FB:A2:55:2C:CB:F0:F6:3E:4D:B8:02:03
X509v3 Subject Alternative Name:
DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster, DNS:kubernetes.default.svc.cluster.local, IP Address:127.0.0.1, IP Address:172.20.0.112, IP Address:172.20.0.113, IP Address:172.20.0.114, IP Address:172.20.0.115, IP Address:10.254.0.1
Signature Algorithm: sha256WithRSAEncryption
63:50:f6:2a:03:c7:35:dd:e9:10:8d:2f:b3:27:9a:64:f3:e1:
11:8a:18:1e:fa:6d:85:30:11:b4:59:a3:6c:86:cd:2b:5c:67:
17:4f:aa:0d:bb:4c:ee:c8:af:e7:3d:61:6d:03:9d:14:6f:00:
48:56:59:b5:76:13:a9:30:23:e0:b2:d2:12:64:0c:60:0d:76:
ec:c6:4f:b1:bc:24:01:7a:48:c6:fd:9e:5d:68:da:b9:a1:ad:
30:7a:ba:90:e2:e3:4e:b4:92:1b:c5:f2:8c:c1:b0:3d:fc:14:
d2:46:e8:f8:22:8f:d9:4d:85:4f:58:6b:0f:84:78:06:b4:b9:
92:b9:0d:bd:1d:95:e9:0d:42:d3:fd:dd:2a:59:60:3f:63:35:
eb:07:25:d2:ea:0d:19:a6:f3:dc:92:8e:ee:73:04:15:5e:97:
e8:da:51:c3:69:49:96:36:c7:cc:5b:e5:e5:cb:e5:ce:9f:21:
6f:6b:56:16:bf:85:ad:1c:8c:91:c1:91:0a:90:18:e2:4a:b0:
32:58:33:ef:55:8e:8f:4a:e3:0f:b8:f7:41:04:65:89:e1:1b:
d8:41:28:6e:84:c3:1c:8e:a9:a0:8a:42:e4:fe:d7:fe:0e:24:
dc:74:37:fa:5e:be:20:69:c5:9a:5a:e6:83:1c:0b:9e:e1:43:
ef:4f:7a:37
字段说明:
- 确认
Issuer字段的内容和ca-csr.json一致; - 确认
Subject字段的内容和kubernetes-csr.json一致; - 确认
X509v3 Subject Alternative Name字段的内容和kubernetes-csr.json一致; - 确认
X509v3 Key Usage、Extended Key Usage字段的内容和ca-config.json中kubernetesprofile 一致;
9. 分发证书
将生成的证书和秘钥文件(后缀名为.pem)拷贝到所有机器的 /etc/kubernetes/ssl 目录下。
mkdir -p /etc/kubernetes/ssl
cp *.pem /etc/kubernetes/ssl
参考文章:
2.3 - k8s版本说明
1. k8s版本号说明
k8s维护最新三个版本的发布分支([2022.7.2]当前最新三个版本为1.24、1.23、1.22),Kubernetes 1.19 和更新的版本获得大约 1 年的补丁支持。
Kubernetes 版本表示为 x.y.z, 其中 x 是主要版本,y 是次要版本,z 是补丁版本。遵循语义化版本规范。
2. 最新发行版本
1.24
最新发行版本:1.24.2 (发布日期: 2022-06-15)
不再支持:2023-09-29
Complete 1.24 Schedule and Changelog
Kubernetes 1.24 使用 go1.18构建,默认情况下将不再验证使用 SHA-1 哈希算法签名的证书。
2.1.1. 重要更新
1.24.0主要参考kubernetes/CHANGELOG-1.24.major-themes
1)kubelet完全移除Dockershim【最重大更新】
在 v1.20 中弃用后,dockershim 组件已从 kubelet 中删除。从 v1.24 开始,您将需要使用其他受支持的运行时之一(例如 containerd 或 CRI-O),或者如果您依赖 Docker 引擎作为容器运行时,则使用 cri-dockerd。有关确保您的集群已准备好进行此移除的更多信息,请参阅本[指南](Is Your Cluster Ready for v1.24? | Kubernetes)。
2)Beta API 默认关闭
默认情况下,不会在集群中启用新的 beta API。默认情况下,现有的 beta API 和现有 beta API 的新版本将继续启用。
3)存储容量和卷扩展到GA
存储容量跟踪支持通过 CSIStorageCapacity 对象公开当前可用的存储容量,并增强使用具有后期绑定的 CSI 卷的 pod 的调度。
卷扩展增加了对调整现有持久卷大小的支持。
4)避免 IP 分配给service的冲突
Kubernetes 1.24 引入了一项新的选择加入功能,允许您为服务的静态 IP 地址分配软预留范围。通过手动启用此功能,集群将更喜欢从服务 IP 地址池中自动分配,从而降低冲突风险。
可以分配 Service ClusterIP:
-
动态,这意味着集群将自动在配置的服务 IP 范围内选择一个空闲 IP。
-
静态,这意味着用户将在配置的服务 IP 范围内设置一个 IP。
Service ClusterIP 是唯一的,因此,尝试使用已分配的 ClusterIP 创建 Service 将返回错误。
2.1.2. 弃用更新
1)kubeadm
-
kubeadm.k8s.io/v1beta2 已被弃用,并将在未来的版本中删除,可能在 3 个版本(一年)中。您应该开始将 kubeadm.k8s.io/v1beta3 用于新集群。要迁移磁盘上的旧配置文件,您可以使用 kubeadm config migrate 命令。
-
默认 kubeadm 配置为 containerd 套接字(Unix:unix:///var/run/containerd/containerd.sock,Windows:npipe:////./pipe/containerd-containerd)而不是 Docker 的配置.如果在集群创建期间 Init|JoinConfiguration.nodeRegistration.criSocket 字段为空,并且在主机上发现多个套接字,则总是会抛出错误并要求用户通过设置字段中的值来指定要使用的套接字。使用 crictl 与 CRI 套接字进行所有通信,以执行诸如拉取图像和获取正在运行的容器列表等操作,而不是在 Docker 的情况下使用 docker CLI。
-
kubeadm 迁移到标签和污点中不再使用 master 一词。对于新的集群,标签 node-role.kubernetes.io/master 将不再添加到控制平面节点,只会添加标签 node-role.kubernetes.io/control-plane。
2)kube-apiserver
-
不安全的地址标志 --address、--insecure-bind-address、--port 和 --insecure-port(自 1.20 起惰性)被删除
-
弃用了--master-countflag 和--endpoint-reconciler-type=master-countreconciler,转而使用 lease reconciler。
-
已弃用Service.Spec.LoadBalancerIP。
3)kube-controller-manager
- kube-controller-manager 中的不安全地址标志 --address 和 --port 自 v1.20 起无效,并在 v1.24 中被删除。
4)kubelet
-
--pod-infra-container-image kubelet 标志已弃用,将在未来版本中删除。
-
以下与 dockershim 相关的标志也与 dockershim 一起被删除 --experimental-dockershim-root-directory、--docker-endpoint、--image-pull-progress-deadline、--network-plugin、--cni-conf -dir,--cni-bin-dir,--cni-cache-dir,--network-plugin-mtu。(#106907, @cyclinder)
1.23
最新发行版本:1.23.8 (发布日期: 2022-06-15)
不再支持:2023-02-28
补丁版本: 1.23.1、 1.23.2、 1.23.3、 1.23.4、 1.23.5、 1.23.6、 1.23.7、 1.23.8
Complete 1.23 Schedule and Changelog
Kubernetes 是使用 golang 1.17 构建的。此版本的 go 删除了使用 GODEBUG=x509ignoreCN=0 环境设置来重新启用将 X.509 服务证书的 CommonName 视为主机名的已弃用旧行为的能力。
2.2.1. 重要更新
1)FlexVolume 已弃用
FlexVolume 已弃用。 Out-of-tree CSI 驱动程序是在 Kubernetes 中编写卷驱动程序的推荐方式。FlexVolume 驱动程序的维护者应实施 CSI 驱动程序并将 FlexVolume 的用户转移到 CSI。 FlexVolume 的用户应将其工作负载转移到 CSI 驱动程序。
2)IPv4/IPv6 双栈网络到 GA
IPv4/IPv6 双栈网络从 GA 毕业。从 1.21 开始,Kubernetes 集群默认启用支持双栈网络。在 1.23 中,移除了 IPv6DualStack 功能门。双栈网络的使用不是强制性的。尽管启用了集群以支持双栈网络,但 Pod 和服务继续默认为单栈。要使用双栈网络:Kubernetes 节点具有可路由的 IPv4/IPv6 网络接口,使用支持双栈的 CNI 网络插件,Pod 配置为双栈,服务的 .spec.ipFamilyPolicy 字段设置为 PreferDualStack 或需要双栈。
3)HorizontalPodAutoscaler v2 到 GA
HorizontalPodAutoscaler API 的第 2 版在 1.23 版本中逐渐稳定。 HorizontalPodAutoscaler autoscaling/v2beta2 API 已弃用,取而代之的是新的 autoscaling/v2 API,Kubernetes 项目建议将其用于所有用例。
4)Scheduler简化多点插件配置
kube-scheduler 正在为插件添加一个新的、简化的配置字段,以允许在一个位置启用多个扩展点。新的 multiPoint 插件字段旨在为管理员简化大多数调度程序设置。通过 multiPoint 启用的插件将自动为它们实现的每个单独的扩展点注册。例如,实现 Score 和 Filter 扩展的插件可以同时为两者启用。这意味着可以启用和禁用整个插件,而无需手动编辑单个扩展点设置。这些扩展点现在可以被抽象出来,因为它们与大多数用户无关。
2.2.2. 已知问题
在 1.22 Kubernetes 版本附带的 etcd v3.5.0 版本中发现了数据损坏问题。请阅读 etcd 的最新[生产建议](etcd/CHANGELOG at main · etcd-io/etcd · GitHub)。
运行etcd v3.5.2 v3.5.1和v3.5.0高负荷会导致数据损坏问题。如果etcd进程被杀,偶尔有些已提交的事务并不反映在所有的成员。建议升级到v3.5.3。
最低推荐etcd版本运行在生产3.3.18 + 3.4.2 + v3.5.3 +。
1.22
最新发行版本:1.22.11 (发布日期: 2022-06-15)
不再支持:2022-10-28
补丁版本: 1.22.1、 1.22.2、 1.22.3、 1.22.4、 1.22.5、 1.22.6、 1.22.7、 1.22.8、 1.22.9、 1.22.10、 1.22.11
Complete 1.22 Schedule and Changelog
2.3.1. 重要更新
1)kubeadm
-
允许非root用户允许kubeadm。
-
现在V1beta3首选API版本;v1beta2 API也仍然是可用的,并没有弃用。
-
移除对docker cgroup driver的检查,kubeadm默认使用systemd cgroup driver,需要手动将runtime配置为systemd。
-
v1beta3中删除ClusterConfiguration.DNS字段,因为CoreDNS是唯一支持DNS类型。
2)etcd
- etcd使用v3.5.0版本。(但是在1.23版本中发现v3.5.0有数据损坏的问题)
3)kubelet
-
节点支持swap内存。
-
作为α特性,Kubernetes v1.22并且可以使用cgroup v2 API来控制内存分配和隔离。这个功能的目的是改善工作负载和节点可用性时对内存资源的争用。
3. 版本偏差策略
3.1. 支持的版本偏差
总结:
-
kubelet版本不能比kube-apiserver版本新,最多只可落后两个次要版本。 -
kube-controller-manager、kube-scheduler和cloud-controller-manager不能比kube-apiserver版本新。最多落后一个次要版本(允许实时升级)。 -
kubectl在kube-apiserver的一个次要版本(较旧或较新)中支持。 -
kube-proxy和节点上的kubelet必须是相同的次要版本。
1)kube-apiserver
在高可用性(HA)集群中, 最新版和最老版的 kube-apiserver 实例版本偏差最多为一个次要版本。
例如:
- 最新的
kube-apiserver实例处于 1.24 版本 - 其他
kube-apiserver实例支持 1.24 和 1.23 版本
2)kubelet
kubelet 版本不能比 kube-apiserver 版本新,并且最多只可落后两个次要版本。
例如:
kube-apiserver处于 1.24 版本kubelet支持 1.24、1.23 和 1.22 版本
说明:
如果 HA 集群中的 kube-apiserver 实例之间存在版本偏差,这会缩小允许的 kubelet 版本范围。
例如:
kube-apiserver实例处于 1.24 和 1.23 版本kubelet支持 1.23 和 1.22 版本, (不支持 1.24 版本,因为这将比kube-apiserver1.23 版本的实例新)
3)kube-controller-manager、kube-scheduler 和 cloud-controller-manager
kube-controller-manager、kube-scheduler 和 cloud-controller-manager 不能比与它们通信的 kube-apiserver 实例新。 它们应该与 kube-apiserver 次要版本相匹配,但可能最多旧一个次要版本(允许实时升级)。
例如:
kube-apiserver处于 1.24 版本kube-controller-manager、kube-scheduler和cloud-controller-manager支持 1.24 和 1.23 版本
说明:
如果 HA 集群中的 kube-apiserver 实例之间存在版本偏差, 并且这些组件可以与集群中的任何 kube-apiserver 实例通信(例如,通过负载均衡器),这会缩小这些组件所允许的版本范围。
例如:
kube-apiserver实例处于 1.24 和 1.23 版本kube-controller-manager、kube-scheduler和cloud-controller-manager与可以路由到任何kube-apiserver实例的负载均衡器通信kube-controller-manager、kube-scheduler和cloud-controller-manager支持 1.23 版本(不支持 1.24 版本,因为它比 1.23 版本的kube-apiserver实例新)
4)kubectl
kubectl 在 kube-apiserver 的一个次要版本(较旧或较新)中支持。
例如:
kube-apiserver处于 1.24 版本kubectl支持 1.25、1.24 和 1.23 版本
说明:
如果 HA 集群中的 kube-apiserver 实例之间存在版本偏差,这会缩小支持的 kubectl 版本范围。
例如:
kube-apiserver实例处于 1.24 和 1.23 版本kubectl支持 1.24 和 1.23 版本(其他版本将与kube-apiserver组件之一相差不止一个的次要版本)
5)kube-proxy
kube-proxy和节点上的kubelet必须是相同的次要版本。kube-proxy版本不能比kube-apiserver版本新。kube-proxy最多只能比kube-apiserver落后两个次要版本。
例如:
如果 kube-proxy 版本处于 1.22 版本:
kubelet必须处于相同的次要版本 1.22。kube-apiserver版本必须介于 1.22 和 1.24 之间,包括两者。
3.2. 组件升级顺序
优先升级kube-apiserver,其他的组件按照上述的版本要求进行升级,最好保持一致的版本。
4. k8s版本发布周期
k8s每年大概发布三次,即3-4个月发布一次大版本(发布版本为 vX.Y 里程碑创建的 Git 分支 release-X.Y)。
发布过程可被认为具有三个主要阶段:
- 特性增强定义
- 实现
- 稳定
4.1. 发布周期
1)正常开发(第 1-11 周)
-
/sig {name}
-
/sig {name}
-
/kind {type}
-
/lgtm
-
/approved
2)代码冻结(第 12-14 周)
- /milestone {v1.y}
- /sig {name}
- /kind {bug, failing-test}
- /lgtm
- /approved
3)发布后(第 14 周以上)
回到“正常开发”阶段要求:
- /sig {name}
- /kind {type}
- /lgtm
- /approved
参考:
3 - 基本概念
3.1 - kubernetes架构
3.1.1 - Kubernetes总架构图
1. Kubernetes的总架构图
2. Kubernetes各个组件介绍
2.1 kube-master[控制节点]
master的工作流程图
- Kubecfg将特定的请求,比如创建Pod,发送给Kubernetes Client。
- Kubernetes Client将请求发送给API server。
- API Server根据请求的类型,比如创建Pod时storage类型是pods,然后依此选择何种REST Storage API对请求作出处理。
- REST Storage API对的请求作相应的处理。
- 将处理的结果存入高可用键值存储系统Etcd中。
- 在API Server响应Kubecfg的请求后,Scheduler会根据Kubernetes Client获取集群中运行Pod及Minion/Node信息。
- 依据从Kubernetes Client获取的信息,Scheduler将未分发的Pod分发到可用的Minion/Node节点上。
2.1.1 API Server[资源操作入口]
-
提供了资源对象的唯一操作入口,其他所有组件都必须通过它提供的API来操作资源数据,只有API Server与存储通信,其他模块通过API Server访问集群状态。
第一,是为了保证集群状态访问的安全。
第二,是为了隔离集群状态访问的方式和后端存储实现的方式:API Server是状态访问的方式,不会因为后端存储技术etcd的改变而改变。
-
作为kubernetes系统的入口,封装了核心对象的增删改查操作,以RESTFul接口方式提供给外部客户和内部组件调用。对相关的资源数据“全量查询”+“变化监听”,实时完成相关的业务功能。
更多API Server信息请参考:Kubernetes核心原理(一)之API Server
2.1.2 Controller Manager[内部管理控制中心]
- 实现集群故障检测和恢复的自动化工作,负责执行各种控制器,主要有:
- endpoint-controller:定期关联service和pod(关联信息由endpoint对象维护),保证service到pod的映射总是最新的。
- replication-controller:定期关联replicationController和pod,保证replicationController定义的复制数量与实际运行pod的数量总是一致的。
更多Controller Manager信息请参考:Kubernetes核心原理(二)之Controller Manager
2.1.3 Scheduler[集群分发调度器]
- Scheduler收集和分析当前Kubernetes集群中所有Minion/Node节点的资源(内存、CPU)负载情况,然后依此分发新建的Pod到Kubernetes集群中可用的节点。
- 实时监测Kubernetes集群中未分发和已分发的所有运行的Pod。
- Scheduler也监测Minion/Node节点信息,由于会频繁查找Minion/Node节点,Scheduler会缓存一份最新的信息在本地。
- 最后,Scheduler在分发Pod到指定的Minion/Node节点后,会把Pod相关的信息Binding写回API Server。
更多Scheduler信息请参考:Kubernetes核心原理(三)之Scheduler
2.2 kube-node[服务节点]
kubelet结构图
2.2.1 Kubelet[节点上的Pod管家]
-
负责Node节点上pod的创建、修改、监控、删除等全生命周期的管理
-
定时上报本Node的状态信息给API Server。
-
kubelet是Master API Server和Minion/Node之间的桥梁,接收Master API Server分配给它的commands和work,通过kube-apiserver间接与Etcd集群交互,读取配置信息。
-
具体的工作如下:
-
设置容器的环境变量、给容器绑定Volume、给容器绑定Port、根据指定的Pod运行一个单一容器、给指定的Pod创建network 容器。
-
同步Pod的状态、同步Pod的状态、从cAdvisor获取container info、 pod info、 root info、 machine info。
-
在容器中运行命令、杀死容器、删除Pod的所有容器。
-
更多Kubelet信息请参考:Kubernetes核心原理(四)之Kubelet
2.2.2 Proxy[负载均衡、路由转发]
- Proxy是为了解决外部网络能够访问跨机器集群中容器提供的应用服务而设计的,运行在每个Minion/Node上。Proxy提供TCP/UDP sockets的proxy,每创建一种Service,Proxy主要从etcd获取Services和Endpoints的配置信息(也可以从file获取),然后根据配置信息在Minion/Node上启动一个Proxy的进程并监听相应的服务端口,当外部请求发生时,Proxy会根据Load Balancer将请求分发到后端正确的容器处理。
- Proxy不但解决了同一主宿机相同服务端口冲突的问题,还提供了Service转发服务端口对外提供服务的能力,Proxy后端使用了随机、轮循负载均衡算法。
2.2.3 kubectl[集群管理命令行工具集]
- 通过客户端的kubectl命令集操作,API Server响应对应的命令结果,从而达到对kubernetes集群的管理。
参考文章:
https://yq.aliyun.com/articles/47308?spm=5176.100240.searchblog.19.jF7FFa
3.1.2 - 基于Docker及Kubernetes技术构建容器云(PaaS)平台
[编者的话]
目前很多的容器云平台通过Docker及Kubernetes等技术提供应用运行平台,从而实现运维自动化,快速部署应用、弹性伸缩和动态调整应用环境资源,提高研发运营效率。
从宏观到微观(从抽象到具体)的思路来理解:云计算→PaaS→ App Engine→XAE[XXX App Engine] (XAE泛指一类应用运行平台,例如GAE、SAE、BAE等)。
本文简要介绍了与容器云相关的几个重要概念:PaaS、App Engine、Dokcer、Kubernetes。
1. PaaS概述
1.1. PaaS概念
- PaaS(Platform as a service),平台即服务,指将软件研发的平台(或业务基础平台)作为一种服务,以SaaS的模式提交给用户。
- PaaS是云计算服务的其中一种模式,云计算是一种按使用量付费的模式的服务,类似一种租赁服务,服务可以是基础设施计算资源(IaaS),平台(PaaS),软件(SaaS)。租用IT资源的方式来实现业务需要,如同水力、电力资源一样,计算、存储、网络将成为企业IT运行的一种被使用的资源,无需自己建设,可按需获得。
- PaaS的实质是将互联网的资源服务化为可编程接口,为第三方开发者提供有商业价值的资源和服务平台。简而言之,IaaS就是卖硬件及计算资源,PaaS就是卖开发、运行环境,SaaS就是卖软件。
1.2. IaaS/PaaS/SaaS说明
| 类型 | 说明 | 比喻 | 例子 |
|---|---|---|---|
| IaaS:Infrastructure-as-a-Service(基础设施即服务) | 提供的服务是计算基础设施 | 地皮,需要自己盖房子 | Amazon EC2(亚马逊弹性云计算) |
| PaaS: Platform-as-a-Service(平台即服务) | 提供的服务是软件研发的平台或业务基础平台 | 商品房,需要自己装修 | GAE(谷歌开发者平台) |
| SaaS: Software-as-a-Service(软件即服务) | 提供的服务是运行在云计算基础设施上的应用程序 | 酒店套房,可以直接入住 | 谷歌的Gmail邮箱 |
1.3. PaaS的特点(三种层次)
| 特点 | 说明 |
|---|---|
| 平台即服务 | PaaS提供的服务就是个基础平台,一个环境,而不是具体的应用 |
| 平台及服务 | 不仅提供平台,还提供对该平台的技术支持、优化等服务 |
| 平台级服务 | “平台级服务”即强大稳定的平台和专业的技术支持团队,保障应用的稳定使用 |
2. App Engine概述
2.1. App Engine概念
App Engine是PaaS模式的一种实现方式,App Engine将应用运行所需的 IT 资源和基础设施以服务的方式提供给用户,包括了中间件服务、资源管理服务、弹性调度服务、消息服务等多种服务形式。App Engine的目标是对应用提供完整生命周期(包括设计、开发、测试和部署等阶段)的支持,从而减少了用户在购置和管理应用生命周期内所必须的软硬件以及部署应用和IT 基础设施的成本,同时简化了以上工作的复杂度。常见的App Engine有:GAE(Google App Engine),SAE(Sina App Engine),BAE(Baidu App Engine)。
App Engine利用虚拟化与自动化技术实现快速搭建部署应用运行环境和动态调整应用运行时环境资源这两个目标。一方面实现即时部署以及快速回收,降低了环境搭建时间,避免了手工配置错误,快速重复搭建环境,及时回收资源, 减少了低利用率硬件资源的空置。另一方面,根据应用运行时的需求对应用环境进行动态调整,实现了应用平台的弹性扩展和自优化,减少了非高峰时硬件资源的空置。
简而言之,App Engine主要目标是:Easy to maintain(维护), Easy to scale(扩容), Easy to build(构建)。
2.2. 架构设计
2.3. 组成模块说明
| 组成模块 | 模块说明 |
|---|---|
| App Router[流量接入层] | 接收用户请求,并转发到不同的App Runtime。 |
| App Runtime[应用运行层] | 应用运行环境,为各个应用提供基本的运行引擎,从而让app能够运行起来。 |
| Services[基础服务层] | 各个通用基础服务,主要是对主流的服务提供通用的接入,例如数据库等。 |
| Platform Control[平台控制层] | 整个平台的控制中心,实现业务调度,弹性扩容、资源审计、集群管理等相关工作。 |
| Manage System[管理界面层] | 提供友好可用的管理操作界面方便平台管理员来控制管理整个平台。 |
| Platform Support[平台支持层] | 为应用提供相关的支持,比如应用监控、问题定位、分布式日志重建、统计分析等。 |
| Log Center[日志中心] | 实时收集相关应用及系统的日志(日志收集),提供实时计算和分析平台(日志处理)。 |
| Code Center[代码中心] | 完成代码存储、部署上线相关的工作。 |
3. 容器云平台技术栈
| 功能组成部分 | 使用工具 |
|---|---|
| 应用载体 | Docker |
| 编排工具 | Kubernetes |
| 配置数据 | Etcd |
| 网络管理 | Flannel |
| 存储管理 | Ceph |
| 底层实现 | Linux内核的Namespace[资源隔离]和CGroups[资源控制] |
- Namespace[资源隔离] Namespaces机制提供一种资源隔离方案。PID,IPC,Network等系统资源不再是全局性的,而是属于某个特定的Namespace。每个namespace下的资源对于其他namespace下的资源都是透明,不可见的。
- CGroups[资源控制] CGroup(control group)是将任意进程进行分组化管理的Linux内核功能。CGroup本身是提供将进程进行分组化管理的功能和接口的基础结构,I/O或内存的分配控制等具体的资源管理功能是通过这个功能来实现的。CGroups可以限制、记录、隔离进程组所使用的物理资源(包括:CPU、memory、IO等),为容器实现虚拟化提供了基本保证。CGroups本质是内核附加在程序上的一系列钩子(hooks),通过程序运行时对资源的调度触发相应的钩子以达到资源追踪和限制的目的。
4. Docker概述
更多详情请参考:Docker整体架构图
4.1. Docker介绍
- Docker - Build, Ship, and Run Any App, Anywhere
- Docker是一种Linux容器工具集,它是为“构建(Build)、交付(Ship)和运行(Run)”分布式应用而设计的。
- Docker相当于把应用以及应用所依赖的环境完完整整地打成了一个包,这个包拿到哪里都能原生运行。因此可以在开发、测试、运维中保证环境的一致性。
- Docker的本质:Docker=LXC(Namespace+CGroups)+Docker Images,即在Linux内核的Namespace[资源隔离]和CGroups[资源控制]技术的基础上通过镜像管理机制来实现轻量化设计。
4.2. Docker的基本概念
4.2.1. 镜像
Docker 镜像就是一个只读的模板,可以把镜像理解成一个模子(模具),由模子(镜像)制作的成品(容器)都是一样的(除非在生成时加额外参数),修改成品(容器)本身并不会对模子(镜像)产生影响(除非将成品提交成一个模子),容器重启时,即由模子(镜像)重新制作成一个成品(容器),与其他由该模子制作成的成品并无区别。
例如:一个镜像可以包含一个完整的 ubuntu 操作系统环境,里面仅安装了 Apache 或用户需要的其它应用程序。镜像可以用来创建 Docker 容器。Docker 提供了一个很简单的机制来创建镜像或者更新现有的镜像,用户可以直接从其他人那里下载一个已经做好的镜像来直接使用。
4.2.2. 容器
Docker 利用容器来运行应用。容器是从镜像创建的运行实例。它可以被启动、开始、停止、删除。每个容器都是相互隔离的、保证安全的平台。可以把容器看做是一个简易版的 Linux 环境(包括root用户权限、进程空间、用户空间和网络空间等)和运行在其中的应用程序。
4.2.3. 仓库
仓库是集中存放镜像文件的场所。有时候会把仓库和仓库注册服务器(Registry)混为一谈,并不严格区分。实际上,仓库注册服务器上往往存放着多个仓库,每个仓库中又包含了多个镜像,每个镜像有不同的标签(tag)。
4.3. Docker的优势
-
容器的快速轻量
容器的启动,停止和销毁都是以秒或毫秒为单位的,并且相比传统的虚拟化技术,使用容器在CPU、内存,网络IO等资源上的性能损耗都有同样水平甚至更优的表现。
-
一次构建,到处运行
当将容器固化成镜像后,就可以非常快速地加载到任何环境中部署运行。而构建出来的镜像打包了应用运行所需的程序、依赖和运行环境, 这是一个完整可用的应用集装箱,在任何环境下都能保证环境一致性。
-
完整的生态链
容器技术并不是Docker首创,但是以往的容器实现只关注于如何运行,而Docker站在巨人的肩膀上进行整合和创新,特别是Docker镜像的设计,完美地解决了容器从构建、交付到运行,提供了完整的生态链支持。
5. Kubernetes概述
更多详情请参考:Kubernetes总架构图
5.1. Kubernetes介绍
Kubernetes是Google开源的容器集群管理系统。它构建Docker技术之上,为容器化的应用提供资源调度、部署运行、服务发现、扩容缩容等整一套功能,本质上可看作是基于容器技术的Micro-PaaS平台,即第三代PaaS的代表性项目。
5.2. Kubernetes的基本概念
5.2.1. Pod
Pod是若干个相关容器的组合,是一个逻辑概念,Pod包含的容器运行在同一个宿主机上,这些容器使用相同的网络命名空间、IP地址和端口,相互之间能通过localhost来发现和通信,共享一块存储卷空间。在Kubernetes中创建、调度和管理的最小单位是Pod。一个Pod一般只放一个业务容器和一个用于统一网络管理的网络容器。
5.2.2. Replication Controller
Replication Controller是用来控制管理Pod副本(Replica,或者称实例),Replication Controller确保任何时候Kubernetes集群中有指定数量的Pod副本在运行,如果少于指定数量的Pod副本,Replication Controller会启动新的Pod副本,反之会杀死多余的以保证数量不变。另外Replication Controller是弹性伸缩、滚动升级的实现核心。
5.2.3. Service
Service是真实应用服务的抽象,定义了Pod的逻辑集合和访问这个Pod集合的策略,Service将代理Pod对外表现为一个单一访问接口,外部不需要了解后端Pod如何运行,这给扩展或维护带来很大的好处,提供了一套简化的服务代理和发现机制。
5.2.4. Label
Label是用于区分Pod、Service、Replication Controller的Key/Value键值对,实际上Kubernetes中的任意API对象都可以通过Label进行标识。每个API对象可以有多个Label,但是每个Label的Key只能对应一个Value。Label是Service和Replication Controller运行的基础,它们都通过Label来关联Pod,相比于强绑定模型,这是一种非常好的松耦合关系。
5.2.5. Node
Kubernets属于主从的分布式集群架构,Kubernets Node(简称为Node,早期版本叫做Minion)运行并管理容器。Node作为Kubernetes的操作单元,将用来分配给Pod(或者说容器)进行绑定,Pod最终运行在Node上,Node可以认为是Pod的宿主机。
5.3. Kubernetes架构
3.2 - kubernetes对象
3.2.1 - Kubernetes基本概念
1. Master
集群的控制节点,负责整个集群的管理和控制,kubernetes的所有的命令基本都是发给Master,由它来负责具体的执行过程。
1.1. Master的组件
- kube-apiserver:资源增删改查的入口
- kube-controller-manager:资源对象的大总管
- kube-scheduler:负责资源调度(Pod调度)
- etcd Server:kubernetes的所有的资源对象的数据保存在etcd中。
2. Node
Node是集群的工作负载节点,默认情况kubelet会向Master注册自己,一旦Node被纳入集群管理范围,kubelet会定时向Master汇报自身的情报,包括操作系统,Docker版本,机器资源情况等。
如果Node超过指定时间不上报信息,会被Master判断为“失联”,标记为Not Ready,随后Master会触发Pod转移。
2.1. Node的组件
- kubelet:Pod的管家,与Master通信
- kube-proxy:实现kubernetes Service的通信与负载均衡机制的重要组件
- Docker:容器的创建和管理
2.2. Node相关命令
kubectl get nodes
kuebctl describe node {node_name}
2.3. describe命令的Node信息
- Node基本信息:名称、标签、创建时间等
- Node当前的状态,Node启动后会进行自检工作,磁盘是否满,内存是否不足,若都正常则切换为Ready状态。
- Node的主机地址与主机名
- Node上的资源总量:CPU,内存,最大可调度Pod数量等
- Node可分配资源量:当前Node可用于分配的资源量
- 主机系统信息:主机唯一标识符UUID,Linux kernel版本号,操作系统,kubernetes版本,kubelet与kube-proxy版本
- 当前正在运行的Pod列表及概要信息
- 已分配的资源使用概要,例如资源申请的最低、最大允许使用量占系统总量的百分比
- Node相关的Event信息。
3. Pod
Pod是Kubernetes中操作的基本单元。每个Pod中有个根容器(Pause容器),Pause容器的状态代表整个容器组的状态,其他业务容器共享Pause的IP,即Pod IP,共享Pause挂载的Volume,这样简化了同个Pod中不同容器之间的网络问题和文件共享问题。

- Kubernetes集群中,同宿主机的或不同宿主机的Pod之间要求能够TCP/IP直接通信,因此采用虚拟二层网络技术来实现,例如Flannel,Openvswitch(OVS)等,这样在同个集群中,不同的宿主机的Pod IP为不同IP段的IP,集群中的所有Pod IP都是唯一的,不同Pod之间可以直接通信。
- Pod有两种类型:普通Pod和静态Pod。静态Pod即不通过K8S调度和创建,直接在某个具体的Node机器上通过具体的文件来启动。普通Pod则是由K8S创建、调度,同时数据存放在ETCD中。
- Pod IP和具体的容器端口(ContainnerPort)组成一个具体的通信地址,即Endpoint。一个Pod中可以存在多个容器,可以有多个端口,Pod IP一样,即有多个Endpoint。
- Pod Volume是定义在Pod之上,被各个容器挂载到自己的文件系统中,可以用分布式文件系统实现后端存储功能。
- Pod中的Event事件可以用来排查问题,可以通过kubectl describe pod xxx 来查看对应的事件。
- 每个Pod可以对其能使用的服务器上的计算资源设置限额,一般为CPU和Memory。K8S中一般将千分之一个的CPU配置作为最小单位,用m表示,是一个绝对值,即100m对于一个Core的机器还是48个Core的机器都是一样的大小。Memory配额也是个绝对值,单位为内存字节数。
- 资源配额的两个参数
- Requests:该资源的最小申请量,系统必须满足要求。
- Limits:该资源最大允许使用量,当超过该量,K8S会kill并重启Pod。

4. Label
- Label是一个键值对,可以附加在任何对象上,比如Node,Pod,Service,RC等。Label和资源对象是多对多的关系,即一个Label可以被添加到多个对象上,一个对象也可以定义多个Label。
- Label的作用主要用来实现精细的、多维度的资源分组管理,以便进行资源分配,调度,配置,部署等工作。
- Label通俗理解就是“标签”,通过标签来过滤筛选指定的对象,进行具体的操作。k8s通过Label Selector(标签选择器)来筛选指定Label的资源对象,类似SQL语句中的条件查询(WHERE语句)。
- Label Selector有基于等式和基于集合的两种表达方式,可以多个条件进行组合使用。
- 基于等式:name=redis-slave(匹配name=redis-slave的资源对象);env!=product(匹配所有不具有标签env=product的资源对象)
- 基于集合:name in (redis-slave,redis-master);name not in (php-frontend)(匹配所有不具有标签name=php-frontend的资源对象)
使用场景
- kube-controller进程通过资源对象RC上定义的Label Selector来筛选要监控的Pod副本数,从而实现副本数始终保持预期数目。
- kube-proxy进程通过Service的Label Selector来选择对应Pod,自动建立每个Service到对应Pod的请求转发路由表,从而实现Service的智能负载均衡机制。
- kube-scheduler实现Pod定向调度:对Node定义特定的Label,并且在Pod定义文件中使用NodeSelector标签调度策略。
5. Replication Controller(RC)
RC是k8s系统中的核心概念,定义了一个期望的场景。
主要包括:
- Pod期望的副本数(replicas)
- 用于筛选目标Pod的Label Selector
- 用于创建Pod的模板(template)
RC特性说明:
- Pod的缩放可以通过以下命令实现:kubectl scale rc redis-slave --replicas=3
- 删除RC并不会删除该RC创建的Pod,可以将副本数设置为0,即可删除对应Pod。或者通过kubectl stop /delete命令来一次性删除RC和其创建的Pod。
- 改变RC中Pod模板的镜像版本可以实现滚动升级(Rolling Update)。具体操作见https://kubernetes.io/docs/tasks/run-application/rolling-update-replication-controller/
- Kubernetes1.2以上版本将RC升级为Replica Set,它与当前RC的唯一区别在于Replica Set支持基于集合的Label Selector(Set-based selector),而旧版本RC只支持基于等式的Label Selector(equality-based selector)。
- Kubernetes1.2以上版本通过Deployment来维护Replica Set而不是单独使用Replica Set。即控制流为:Delpoyment→Replica Set→Pod。即新版本的Deployment+Replica Set替代了RC的作用。
6. Deployment
Deployment是kubernetes 1.2引入的概念,用来解决Pod的编排问题。Deployment可以理解为RC的升级版(RC+Reolicat Set)。特点在于可以随时知道Pod的部署进度,即对Pod的创建、调度、绑定节点、启动容器完整过程的进度展示。
使用场景
- 创建一个Deployment对象来生成对应的Replica Set并完成Pod副本的创建过程。
- 检查Deployment的状态来确认部署动作是否完成(Pod副本的数量是否达到预期值)。
- 更新Deployment以创建新的Pod(例如镜像升级的场景)。
- 如果当前Deployment不稳定,回退到上一个Deployment版本。
- 挂起或恢复一个Deployment。
可以通过kubectl describe deployment来查看Deployment控制的Pod的水平拓展过程。
7. Horizontal Pod Autoscaler(HPA)
Horizontal Pod Autoscaler(HPA)即Pod横向自动扩容,与RC一样也属于k8s的资源对象。
HPA原理:通过追踪分析RC控制的所有目标Pod的负载变化情况,来确定是否针对性调整Pod的副本数。
Pod负载度量指标:
- CPUUtilizationPercentage:Pod所有副本自身的CPU利用率的平均值。即当前Pod的CPU使用量除以Pod Request的值。
- 应用自定义的度量指标,比如服务每秒内响应的请求数(TPS/QPS)。
8. Service(服务)
8.1. Service概述

Service定义了一个服务的访问入口地址,前端应用通过这个入口地址访问其背后的一组由Pod副本组成的集群实例,Service与其后端的Pod副本集群之间是通过Label Selector来实现“无缝对接”。RC保证Service的Pod副本实例数目保持预期水平。
8.2. kubernetes的服务发现机制
主要通过kube-dns这个组件来进行DNS方式的服务发现。
8.3. 外部系统访问Service的问题
| IP类型 | 说明 |
|---|---|
| Node IP | Node节点的IP地址 |
| Pod IP | Pod的IP地址 |
| Cluster IP | Service的IP地址 |
8.3.1. Node IP
NodeIP是集群中每个节点的物理网卡IP地址,是真实存在的物理网络,kubernetes集群之外的节点访问kubernetes内的某个节点或TCP/IP服务的时候,需要通过NodeIP进行通信。
8.3.2. Pod IP
Pod IP是每个Pod的IP地址,是Docker Engine根据docker0网桥的IP段地址进行分配的,是一个虚拟二层网络,集群中一个Pod的容器访问另一个Pod中的容器,是通过Pod IP进行通信的,而真实的TCP/IP流量是通过Node IP所在的网卡流出的。
8.3.3. Cluster IP
- Service的Cluster IP是一个虚拟IP,只作用于Service这个对象,由kubernetes管理和分配IP地址(来源于Cluster IP地址池)。
- Cluster IP无法被ping通,因为没有一个实体网络对象来响应。
- Cluster IP结合Service Port组成的具体通信端口才具备TCP/IP通信基础,属于kubernetes集群内,集群外访问该IP和端口需要额外处理。
- k8s集群内Node IP 、Pod IP、Cluster IP之间的通信采取k8s自己的特殊的路由规则,与传统IP路由不同。
8.3.4. 外部访问Kubernetes集群
通过宿主机与容器端口映射的方式进行访问,例如:Service定位文件如下:
可以通过任意Node的IP 加端口访问该服务。也可以通过Nginx或HAProxy来设置负载均衡。
9. Volume(存储卷)
9.1. Volume的功能
- Volume是Pod中能够被多个容器访问的共享目录,可以让容器的数据写到宿主机上或者写文件到网络存储中
- 可以实现容器配置文件集中化定义与管理,通过ConfigMap资源对象来实现。
9.2. Volume的特点
k8s中的Volume与Docker的Volume相似,但不完全相同。
- k8s上Volume定义在Pod上,然后被一个Pod中的多个容器挂载到具体的文件目录下。
- k8s的Volume与Pod生命周期相关而不是容器是生命周期,即容器挂掉,数据不会丢失但是Pod挂掉,数据则会丢失。
- k8s中的Volume支持多种类型的Volume:Ceph、GlusterFS等分布式系统。
9.3. Volume的使用方式
先在Pod上声明一个Volume,然后容器引用该Volume并Mount到容器的某个目录。
9.4. Volume类型
9.4.1. emptyDir
emptyDir Volume是在Pod分配到Node时创建的,初始内容为空,无须指定宿主机上对应的目录文件,由K8S自动分配一个目录,当Pod被删除时,对应的emptyDir数据也会永久删除。
作用:
- 临时空间,例如程序的临时文件,无须永久保留
- 长时间任务的中间过程CheckPoint的临时保存目录
- 一个容器需要从另一个容器中获取数据的目录(即多容器共享目录)
说明:
目前用户无法设置emptyVolume的使用介质,如果kubelet的配置使用硬盘则emptyDir将创建在该硬盘上。
9.4.2. hostPath
hostPath是在Pod上挂载宿主机上的文件或目录。
作用:
- 容器应用日志需要持久化时,可以使用宿主机的高速文件系统进行存储
- 需要访问宿主机上Docker引擎内部数据结构的容器应用时,可以通过定义hostPath为宿主机/var/lib/docker目录,使容器内部应用可以直接访问Docker的文件系统。
注意点:
- 在不同的Node上具有相同配置的Pod可能会因为宿主机上的目录或文件不同导致对Volume上目录或文件的访问结果不一致。
- 如果使用了资源配额管理,则kubernetes无法将hostPath在宿主机上使用的资源纳入管理。
9.4.3. gcePersistentDisk
表示使用谷歌公有云提供的永久磁盘(Persistent Disk ,PD)存放Volume的数据,它与EmptyDir不同,PD上的内容会被永久保存。当Pod被删除时,PD只是被卸载时,但不会被删除。需要先创建一个永久磁盘,才能使用gcePersistentDisk。
使用gcePersistentDisk的限制条件:
- Node(运行kubelet的节点)需要是GCE虚拟机。
- 虚拟机需要与PD存在于相同的GCE项目中和Zone中。
10. Persistent Volume
Volume定义在Pod上,属于“计算资源”的一部分,而Persistent Volume和Persistent Volume Claim是网络存储,简称PV和PVC,可以理解为k8s集群中某个网络存储中对应的一块存储。
- PV是网络存储,不属于任何Node,但可以在每个Node上访问。
- PV不是定义在Pod上,而是独立于Pod之外定义。
- PV常见类型:GCE Persistent Disks、NFS、RBD等。
PV是有状态的对象,状态类型如下:
- Available:空闲状态
- Bound:已经绑定到某个PVC上
- Released:对应的PVC已经删除,但资源还没有回收
- Failed:PV自动回收失败
11. Namespace
Namespace即命名空间,主要用于多租户的资源隔离,通过将资源对象分配到不同的Namespace上,便于不同的分组在共享资源的同时可以被分别管理。
k8s集群启动后会默认创建一个“default”的Namespace。可以通过kubectl get namespaecs查看。
可以通过kubectl config use-context namespace配置当前k8s客户端的环境,通过kubectl get pods获取当前namespace的Pod。或者通过kubectl get pods --namespace=NAMESPACE来获取指定namespace的Pod。
Namespace yaml文件的定义
12. Annotation(注解)
Annotation与Label类似,也使用key/value的形式进行定义,Label定义元数据(Metadata),Annotation定义“附加”信息。
通常Annotation记录信息如下:
- build信息,release信息,Docker镜像信息等。
- 日志库、监控库等。
参考《Kubernetes权威指南》
3.2.2 - 理解kubernetes对象
1. kubernetes对象概述
kubernetes中的对象是一些持久化的实体,可以理解为是对集群状态的描述或期望。
包括:
- 集群中哪些node上运行了哪些容器化应用
- 应用的资源是否满足使用
- 应用的执行策略,例如重启策略、更新策略、容错策略等。
kubernetes的对象是一种意图(期望)的记录,kubernetes会始终保持预期创建的对象存在和集群运行在预期的状态下。
操作kubernetes对象(增删改查)需要通过kubernetes API,一般有以下几种方式:
kubectl命令工具Client library的方式,例如 client-go
2. Spec and Status
每个kubernetes对象的结构描述都包含spec和status两个部分。
spec:该内容由用户提供,描述用户期望的对象特征及集群状态。status:该内容由kubernetes集群提供和更新,描述kubernetes对象的实时状态。
任何时候,kubernetes都会控制集群的实时状态status与用户的预期状态spec一致。
例如:当你定义Deployment的描述文件,指定集群中运行3个实例,那么kubernetes会始终保持集群中运行3个实例,如果任何实例挂掉,kubernetes会自动重建新的实例来保持集群中始终运行用户预期的3个实例。
3. 对象描述文件
当你要创建一个kubernetes对象的时候,需要提供该对象的描述信息spec,来描述你的对象在kubernetes中的预期状态。
一般使用kubernetes API来创建kubernetes对象,其中spec信息可以以JSON的形式存放在request body中,也可以以.yaml文件的形式通过kubectl工具创建。
例如,以下为Deployment对象对应的yaml文件:
apiVersion: apps/v1beta2 # for versions before 1.8.0 use apps/v1beta1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
执行kubectl create的命令
#create command
kubectl create -f https://k8s.io/docs/user-guide/nginx-deployment.yaml --record
#output
deployment "nginx-deployment" created
4. 必须字段
在对象描述文件.yaml中,必须包含以下字段。
- apiVersion:kubernetes API的版本
- kind:kubernetes对象的类型
- metadata:唯一标识该对象的元数据,包括
name,UID,可选的namespace - spec:标识对象的详细信息,不同对象的
spec的格式不同,可以嵌套其他对象的字段。
文章参考:
https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/
3.3 - Pod对象
3.3.1 - Pod介绍
1. Pod是什么(what)
1.1. Pod概念
- Pod是kubernetes集群中最小的部署和管理的
基本单元,协同寻址,协同调度。 - Pod是一个或多个容器的集合,是一个或一组服务(进程)的抽象集合。
- Pod中可以共享网络和存储(可以简单理解为一个逻辑上的虚拟机,但并不是虚拟机)。
- Pod被创建后用一个
UID来唯一标识,当Pod生命周期结束,被一个等价Pod替代,UID将重新生成。
1.1.1. Pod与Docker
- Docker是目前Pod最常用的容器环境,但仍支持其他容器环境。
- Pod是一组被模块化的拥有共享命名空间和共享存储卷的容器,但并没有共享PID 命名空间(即同个Pod的不同容器中进程的PID是独立的,互相看不到非自己容器的进程)。
1.1.2. Pod中容器的运行方式
- 只运行一个单独的容器
即one-container-per-Pod模式,是最常用的模式,可以把这样的Pod看成单独的一个容器去管理。
- 运行多个强关联的容器
即sidecar模式,Pod 封装了一组紧耦合、共享资源、协同寻址的容器,将这组容器作为一个管理单元。
1.2. Pod管理多个容器
Pod是一组紧耦合的容器的集合,Pod内的容器作为一个整体以Pod形式进行协同寻址,协同调度、协同管理。相同Pod内的容器共享网络和存储。
1.2.1. 网络
- 每个Pod被分配了唯一的IP地址,该Pod内的所有容器共享一个网络空间,包括IP和端口。
- 同个Pod不同容器之间通过
localhost通信,Pod内端口不能冲突。 - 不同Pod之间的通信则通过IP+端口的形式来访问到Pod内的具体服务(容器)。
1.2.2. 存储
- 可以在Pod中创建共享
存储卷的方式来实现不同容器之间数据共享。
2. 为什么需要Pod(why)
2.1. 管理需求
Pod 是一种模式的抽象:互相协作的多个进程(容器)共同形成一个完整的服务。以一个或多个容器的方式组合成一个整体,作为管理的基本单元,通过Pod可以方便部署、水平扩展,协同调度等。
2.2. 资源共享和通信
Pod作为多个紧耦合的容器的集合,通过共享网络和存储的方式来简化紧耦合容器之间的通信,从这个角度,可以将Pod简单理解为一个逻辑上的“虚拟机”。而不同的Pod之间的通信则通过Pod的IP和端口的方式。
2.3. Pod设计的优势
- 调度器和控制器的可拔插性。
- 将Pod 的生存期从 controller 中剥离出来,从而减少相互影响。
- 高可用--在终止和删除 Pod 前,需要提前生成替代 Pod。
- 集群级别的功能和 Kubelet(Pod Controller) 级别的功能组合更加清晰。
3. Pod的使用(how)
Pod一般是通过各种不同类型的Controller对Pod进行管理和控制,包括自我恢复(例如Pod因异常退出,则会再起一个相同的Pod替代该Pod,而该Pod则会被清除)。也可以不通过Controller单独创建一个Pod,但一般很少这么操作,因为这个Pod是一个孤立的实体,并不会被Controller管理。
3.1. Controller
Controller是kubernetes中用于对Pod进行管理的控制器,通过该控制器让Pod始终维持在一个用户原本设定或期望的状态。如果节点宕机或者Pod因其他原因死亡,则会在其他节点起一个相同的Pod来替代该Pod。
常用的Controller有:
DeploymentStatefulSetDaemonSet
Controller是通过用户提供的Pod模板来创建和控制Pod。
3.2. Pod模板
Pod模板用来定义Pod的各种属性,Controller通过Pod模板来生成对应的Pod。
Pod模板类似一个饼干模具,通过模具已经生成的饼干与原模具已经没有关系,即对原模具的修改不会影响已经生成的饼干,只会对通过修改后的模具生成的饼干有影响。这种方式可以更加方便地控制和管理Pod。
4. Pod的终止
用户发起一个删除Pod的请求,系统会先发送TERM信号给每个容器的主进程,如果在宽限期(默认30秒)主进程没有自主终止运行,则系统会发送KILL信号给该进程,接着Pod将被删除。
4.1. Pod终止的流程
- 用户发送一个删除 Pod 的命令, 并使用默认的宽限期(30s)。
- 把 API server 上的 pod 的时间更新成 Pod 与宽限期一起被认为 “dead” 之外的时间点。
- 使用客户端的命令,显示出的Pod的状态为
terminating。 - (与第3步同时发生)Kubelet 发现某一个 Pod 由于时间超过第2步的设置而被标志成 terminating 状态时, Kubelet 将启动一个停止进程。
- 如果 pod 已经被定义成一个
preStop hook,这会在 pod 内部进行调用。如果宽限期已经过期但 preStop 锚依然还在运行,将调用第2步并在原来的宽限期上加一个小的时间窗口(2 秒钟)。 - 把 Pod 里的进程发送到
TERM信号。
- 如果 pod 已经被定义成一个
- (与第3步同时发生),Pod 被从终端的服务列表里移除,同时也不再被 replication controllers 看做时一组运行中的 pods。 在负载均衡(比如说 service proxy)会将它们从轮询中移除前, Pods 这种慢关闭的方式可以继续为流量提供服务。
- 当宽期限过期时, 任何还在 Pod 里运行的进程都会被
SIGKILL杀掉。 - Kubelet 通过在 API server 把宽期限设置成0(立刻删除)的方式完成删除 Pod的过程。 这时 Pod 在 API 里消失,也不再能被用户看到。
4.2. 强制删除Pod
强制删除Pod是指从k8s集群状态和Etcd中立刻删除对应的Pod数据,API Server不会等待kubelet的确认信息。被强制删除后,即可重新创建一个相同名字的Pod。
删除默认的宽限期是30秒,通过将宽限期设置为0的方式可以强制删除Pod。
通过kubectl delete 命令后加--force和--grace-period=0的参数强制删除Pod。
kubectl delete pod <pod_name> --namespace=<namespace> --force --grace-period=0
4.3. Pod特权模式
特权模式是指让Pod中的进程具有访问宿主机系统设备或使用网络栈操作等的能力,例如编写网络插件和卷插件。
通过将container spec中的SecurityContext设置为privileged即将该容器赋予了特权模式。特权模式的使用要求k8s版本高于v1.1。
参考文章:
3.3.2 - Pod定义文件
1. Pod的基本用法
1.1. 说明
- Pod实际上是容器的集合,在k8s中对运行容器的要求为:容器的主程序需要一直在前台运行,而不是后台运行。应用可以改造成前台运行的方式,例如Go语言的程序,直接运行二进制文件;java语言则运行主类;tomcat程序可以写个运行脚本。或者通过supervisor的进程管理工具,即supervisor在前台运行,应用程序由supervisor管理在后台运行。具体可参考supervisord。
- 当多个应用之间是紧耦合的关系时,可以将多个应用一起放在一个Pod中,同个Pod中的多个容器之间互相访问可以通过localhost来通信(可以把Pod理解成一个虚拟机,共享网络和存储卷)。
1.2. Pod相关命令
| 操作 | 命令 | 说明 |
|---|---|---|
| 创建 | kubectl create -f frontend-localredis-pod.yaml | |
| 查询Pod运行状态 | kubectl get pods --namespace=<NAMESPACE> |
|
| 查询Pod详情 | kebectl describe pod <POD_NAME> --namespace=<NAMESPACE> |
该命令常用来排查问题,查看Event事件 |
| 删除 | kubectl delete pod <POD_NAME> ;kubectl delete pod --all |
|
| 更新 | kubectl replace pod.yaml | - |
2. Pod的定义文件
apiVersion: v1
kind: Pod
metadata:
name: string
namaspace: string
labels:
- name: string
annotations:
- name: string
spec:
containers:
- name: string
images: string
imagePullPolice: [Always | Never | IfNotPresent]
command: [string]
args: [string]
workingDir: string
volumeMounts:
- name: string
mountPath: string
readOnly: boolean
ports:
- name: string
containerPort: int
hostPort: int
protocol: string
env:
- name: string
value: string
resources:
limits:
cpu: string
memory: string
requests:
cpu: string
memory: string
livenessProbe:
exec:
command: [string]
httpGet:
path: string
port: int
host: string
scheme: string
httpHeaders:
- name: string
value: string
tcpSocket:
port: int
initialDelaySeconds: number
timeoutSeconds: number
periodSeconds: number
successThreshold: 0
failureThreshold: 0
securityContext:
privileged: false
restartPolicy: [Always | Never | OnFailure]
nodeSelector: object
imagePullSecrets:
- name: string
hostNetwork: false
volumes:
- name: string
emptyDir: {}
hostPath:
path: string
secret:
secretName: string
items:
- key: string
path: string
configMap:
name: string
items:
- key: string
path: string
3. 静态pod
静态Pod是由kubelet进行管理,仅存在于特定Node上的Pod。它们不能通过API Server进行管理,无法与ReplicationController、Deployment或DaemonSet进行关联,并且kubelet也无法对其健康检查。
静态Pod总是由kubelet创建,并且总在kubelet所在的Node上运行。
创建静态Pod的方式:
3.1. 通过配置文件方式
需要设置kubelet的启动参数“–config”,指定kubelet需要监控的配置文件所在目录,kubelet会定期扫描该目录,并根据该目录的.yaml或.json文件进行创建操作。静态Pod无法通过API Server删除(若删除会变成pending状态),如需删除该Pod则将yaml或json文件从这个目录中删除。
例如:
配置目录为/etc/kubelet.d/,配置启动参数:--config=/etc/kubelet.d/,该目录下放入static-web.yaml。
apiVersion: v1
kind: Pod
metadata:
name: static-web
labels:
name: static-web
spec:
containers:
- name: static-web
image: nginx
ports:
- name: web
containerPort: 80
参考文章
- 《Kubernetes权威指南》
3.3.3 - Pod生命周期
1. Pod phase
Pod的phase是Pod生命周期中的简单宏观描述,定义在Pod的PodStatus对象的phase 字段中。
phase有以下几种值:
| 状态值 | 说明 |
|---|---|
挂起(Pending) |
Pod 已被 Kubernetes 系统接受,但有一个或者多个容器镜像尚未创建。等待时间包括调度 Pod 的时间和通过网络下载镜像的时间。 |
运行中(Running) |
该 Pod 已经绑定到了一个节点上,Pod 中所有的容器都已被创建。至少有一个容器正在运行,或者正处于启动或重启状态。 |
成功(Succeeded) |
Pod 中的所有容器都被成功终止,并且不会再重启。 |
失败(Failed) |
Pod 中的所有容器都已终止了,并且至少有一个容器是因为失败终止。也就是说,容器以非0状态退出或者被系统终止。 |
未知(Unknown) |
因为某些原因无法取得 Pod 的状态,通常是因为与 Pod 所在主机通信失败。 |
2. Pod 状态
Pod 有一个 PodStatus 对象,其中包含一个 PodCondition 数组。 PodCondition包含以下以下字段:
lastProbeTime:Pod condition最后一次被探测到的时间戳。lastTransitionTime:Pod最后一次状态转变的时间戳。message:状态转化的信息,一般为报错信息,例如:containers with unready status: [c-1]。reason:最后一次状态形成的原因,一般为报错原因,例如:ContainersNotReady。status:包含的值有 True、False 和 Unknown。type:Pod状态的几种类型。
其中type字段包含以下几个值:
PodScheduled:Pod已经被调度到运行节点。Ready:Pod已经可以接收请求提供服务。Initialized:所有的init container已经成功启动。Unschedulable:无法调度该Pod,例如节点资源不够。ContainersReady:Pod中的所有容器已准备就绪。
3. 重启策略
Pod通过restartPolicy字段指定重启策略,重启策略类型为:Always、OnFailure 和 Never,默认为 Always。
restartPolicy 仅指通过同一节点上的 kubelet 重新启动容器。
| 重启策略 | 说明 |
|---|---|
| Always | 当容器失效时,由kubelet自动重启该容器 |
| OnFailure | 当容器终止运行且退出码不为0时,由kubelet自动重启该容器 |
| Never | 不论容器运行状态如何,kubelet都不会重启该容器 |
说明:
可以管理Pod的控制器有Replication Controller,Job,DaemonSet,及kubelet(静态Pod)。
- RC和DaemonSet:必须设置为Always,需要保证该容器持续运行。
- Job:OnFailure或Never,确保容器执行完后不再重启。
- kubelet:在Pod失效的时候重启它,不论RestartPolicy设置为什么值,并且不会对Pod进行健康检查。
4. Pod的生命
Pod的生命周期一般通过Controler 的方式管理,每种Controller都会包含PodTemplate来指明Pod的相关属性,Controller可以自动对pod的异常状态进行重新调度和恢复,除非通过Controller的方式删除其管理的Pod,不然kubernetes始终运行用户预期状态的Pod。
控制器的分类
- 使用
Job运行预期会终止的 Pod,例如批量计算。Job 仅适用于重启策略为OnFailure或Never的 Pod。 - 对预期不会终止的 Pod 使用
ReplicationController、ReplicaSet和Deployment,例如 Web 服务器。 ReplicationController 仅适用于具有restartPolicy为 Always 的 Pod。 - 提供特定于机器的系统服务,使用
DaemonSet为每台机器运行一个 Pod 。
如果节点死亡或与集群的其余部分断开连接,则 Kubernetes 将应用一个策略将丢失节点上的所有 Pod 的 phase 设置为 Failed。
5. Pod状态转换
常见的状态转换
| Pod的容器数 | Pod当前状态 | 发生的事件 | Pod结果状态 | ||
|---|---|---|---|---|---|
| RestartPolicy=Always | RestartPolicy=OnFailure | RestartPolicy=Never | |||
| 包含一个容器 | Running | 容器成功退出 | Running | Succeeded | Succeeded |
| 包含一个容器 | Running | 容器失败退出 | Running | Running | Failure |
| 包含两个容器 | Running | 1个容器失败退出 | Running | Running | Running |
| 包含两个容器 | Running | 容器被OOM杀掉 | Running | Running | Failure |
5.1. 容器运行时内存超出限制
- 容器以失败状态终止。
- 记录 OOM 事件。
- 如果
restartPolicy为:- Always:重启容器;Pod
phase仍为 Running。 - OnFailure:重启容器;Pod
phase仍为 Running。 - Never: 记录失败事件;Pod
phase仍为 Failed。
- Always:重启容器;Pod
5.2. 磁盘故障
- 杀掉所有容器。
- 记录适当事件。
- Pod
phase变成 Failed。 - 如果使用控制器来运行,Pod 将在别处重建。
5.3. 运行节点挂掉
- 节点控制器等待直到超时。
- 节点控制器将 Pod
phase设置为 Failed。 - 如果是用控制器来运行,Pod 将在别处重建。
参考文章:
3.3.4 - Pod健康检查
Pod健康检查
Pod的健康状态由两类探针来检查:LivenessProbe和ReadinessProbe。
1. 探针类型
1. livenessProbe(存活探针)
- 表明容器是否正在运行。
- 如果存活探测失败,则 kubelet 会杀死容器,并且容器将受到其
重启策略的影响。 - 如果容器不提供存活探针,则默认状态为
Success。
2. readinessProbe(就绪探针)
- 表明容器是否可以正常接受请求。
- 如果就绪探测失败,端点控制器将从与 Pod 匹配的所有 Service 的端点中删除该 Pod 的 IP 地址。
- 初始延迟之前的就绪状态默认为
Failure。 - 如果容器不提供就绪探针,则默认状态为
Success。
2. Handler
探针是kubelet对容器执行定期的诊断,主要通过调用容器配置的三类Handler实现:
Handler的类型:
ExecAction:在容器内执行指定命令。如果命令退出时返回码为 0 则认为诊断成功。TCPSocketAction:对指定端口上的容器的 IP 地址进行 TCP 检查。如果端口打开,则诊断被认为是成功的。HTTPGetAction:对指定的端口和路径上的容器的 IP 地址执行 HTTP Get 请求。如果响应的状态码大于等于200 且小于 400,则诊断被认为是成功的。
探测结果为以下三种之一:
成功:容器通过了诊断。失败:容器未通过诊断。未知:诊断失败,因此不会采取任何行动。
3. 探针使用方式
- 如果容器异常可以自动崩溃,则不一定要使用探针,可以由Pod的
restartPolicy执行重启操作。 存活探针适用于希望容器探测失败后被杀死并重新启动,需要指定restartPolicy为 Always 或 OnFailure。就绪探针适用于希望Pod在不能正常接收流量的时候被剔除,并且在就绪探针探测成功后才接收流量。
存活探针由 kubelet 来执行,因此所有的请求都在 kubelet 的网络命名空间中进行。
3.1. LivenessProbe参数
- initialDelaySeconds:启动容器后首次进行健康检查的等待时间,单位为秒。
- timeoutSeconds:健康检查发送请求后等待响应的时间,如果超时响应kubelet则认为容器非健康,重启该容器,单位为秒。
3.2. LivenessProbe三种实现方式
1)ExecAction:在一个容器内部执行一个命令,如果该命令状态返回值为0,则表明容器健康。
apiVersion: v1
kind: Pod
metadata:
name: liveness-exec
spec:
containers:
- name: liveness
image: tomcagcr.io/google_containers/busybox
args:
- /bin/sh
- -c
- echo ok > /tmp/health;sleep 10;rm -fr /tmp/health;sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/health
initialDelaySeconds: 15
timeoutSeconds: 1
2)TCPSocketAction:通过容器IP地址和端口号执行TCP检查,如果能够建立TCP连接,则表明容器健康。
apiVersion: v1
kind: Pod
metadata:
name: pod-with-healthcheck
spec:
containers:
- name: nginx
image: nginx
ports:
- containnerPort: 80
livenessProbe:
tcpSocket:
port: 80
initialDelaySeconds: 15
timeoutSeconds: 1
3)HTTPGetAction:通过容器的IP地址、端口号及路径调用HTTP Get方法,如果响应的状态码大于等于200且小于等于400,则认为容器健康。
apiVersion: v1
kind: Pod
metadata:
name: pod-with-healthcheck
spec:
containers:
- name: nginx
image: nginx
ports:
- containnerPort: 80
livenessProbe:
httpGet:
path: /_status/healthz
port: 80
initialDelaySeconds: 15
timeoutSeconds: 1
参考文章:
- https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
- 《Kubernetes权威指南》
3.3.5 - Pod存储卷
Pod Volume
同一个Pod中的多个容器可以共享Pod级别的存储卷Volume,Volume可以定义为各种类型,多个容器各自进行挂载,将Pod的Volume挂载为容器内部需要的目录。
例如:Pod级别的Volume:"app-logs",用于tomcat向其中写日志文件,busybox读日志文件。

pod-volumes-applogs.yaml
apiVersion: v1
kind: Pod
metadata:
name: volume-pod
spec:
containers:
- name: tomcat
image: tomcat
ports:
- containerPort: 8080
volumeMounts:
- name: app-logs
mountPath: /usr/local/tomcat/logs
- name: busybox
image: busybox
command: ["sh","-c","tailf /logs/catalina*.log"]
volumeMounts:
- name: app-logs
mountPath: /logs
volumes:
- name: app-logs
emptuDir: {}
查看日志
- kubectl logs
<pod_name>-c<container_name> - kubectl exec -it
<pod_name>-c<container_name>– tail /usr/local/tomcat/logs/catalina.xx.log
参考文章
- 《Kubernetes权威指南》
3.3.6 - Pod调度
Pod调度
在kubernetes集群中,Pod(container)是应用的载体,一般通过RC、Deployment、DaemonSet、Job等对象来完成Pod的调度与自愈功能。
1. RC、Deployment:全自动调度
RC的功能即保持集群中始终运行着指定个数的Pod。
在调度策略上主要有:
- 系统内置调度算法[最优Node]
- NodeSelector[定向调度]
- NodeAffinity[亲和性调度]
2. NodeSelector[定向调度]
k8s中kube-scheduler负责实现Pod的调度,内部系统通过一系列算法最终计算出最佳的目标节点。如果需要将Pod调度到指定Node上,则可以通过Node的标签(Label)和Pod的nodeSelector属性相匹配来达到目的。
1、kubectl label nodes {node-name} {label-key}={label-value}
2、nodeSelector: {label-key}:{label-value}
如果给多个Node打了相同的标签,则scheduler会根据调度算法从这组Node中选择一个可用的Node来调度。
如果Pod的nodeSelector的标签在Node中没有对应的标签,则该Pod无法被调度成功。
Node标签的使用场景:
对集群中不同类型的Node打上不同的标签,可控制应用运行Node的范围。例如role=frontend;role=backend;role=database。
3. NodeAffinity[亲和性调度]
NodeAffinity意为Node亲和性调度策略,NodeSelector为精确匹配,NodeAffinity为条件范围匹配,通过In(属于)、NotIn(不属于)、Exists(存在一个条件)、DoesNotExist(不存在)、Gt(大于)、Lt(小于)等操作符来选择Node,使调度更加灵活。
- RequiredDuringSchedulingRequiredDuringExecution:类似于NodeSelector,但在Node不满足条件时,系统将从该Node上移除之前调度上的Pod。
- RequiredDuringSchedulingIgnoredDuringExecution:与上一个类似,区别是在Node不满足条件时,系统不一定从该Node上移除之前调度上的Pod。
- PreferredDuringSchedulingIgnoredDuringExecution:指定在满足调度条件的Node中,哪些Node应更优先地进行调度。同时在Node不满足条件时,系统不一定从该Node上移除之前调度上的Pod。
如果同时设置了NodeSelector和NodeAffinity,则系统将需要同时满足两者的设置才能进行调度。
4. DaemonSet:特定场景调度
DaemonSet是kubernetes1.2版本新增的一种资源对象,用于管理在集群中每个Node上仅运行一份Pod的副本实例。

该用法适用的应用场景:
- 在每个Node上运行一个GlusterFS存储或者Ceph存储的daemon进程。
- 在每个Node上运行一个日志采集程序:fluentd或logstach。
- 在每个Node上运行一个健康程序,采集该Node的运行性能数据,例如:Prometheus Node Exportor、collectd、New Relic agent或Ganglia gmond等。
DaemonSet的Pod调度策略与RC类似,除了使用系统内置算法在每台Node上进行调度,也可以通过NodeSelector或NodeAffinity来指定满足条件的Node范围进行调度。
5. Job:批处理调度
kubernetes从1.2版本开始支持批处理类型的应用,可以通过kubernetes Job资源对象来定义并启动一个批处理任务。批处理任务通常并行(或串行)启动多个计算进程去处理一批工作项(work item),处理完后,整个批处理任务结束。
5.1. 批处理的三种模式

批处理按任务实现方式不同分为以下几种模式:
-
Job Template Expansion模式 一个Job对象对应一个待处理的Work item,有几个Work item就产生几个独立的Job,通过适用于Work item数量少,每个Work item要处理的数据量比较大的场景。例如有10个文件(Work item),每个文件(Work item)为100G。
-
Queue with Pod Per Work Item 采用一个任务队列存放Work item,一个Job对象作为消费者去完成这些Work item,其中Job会启动N个Pod,每个Pod对应一个Work item。
-
Queue with Variable Pod Count 采用一个任务队列存放Work item,一个Job对象作为消费者去完成这些Work item,其中Job会启动N个Pod,每个Pod对应一个Work item。但Pod的数量是可变的。
5.2. Job的三种类型
1)Non-parallel Jobs
通常一个Job只启动一个Pod,除非Pod异常才会重启该Pod,一旦此Pod正常结束,Job将结束。
2)Parallel Jobs with a fixed completion count
并行Job会启动多个Pod,此时需要设定Job的.spec.completions参数为一个正数,当正常结束的Pod数量达到该值则Job结束。
3)Parallel Jobs with a work queue
任务队列方式的并行Job需要一个独立的Queue,Work item都在一个Queue中存放,不能设置Job的.spec.completions参数。
此时Job的特性:
- 每个Pod能独立判断和决定是否还有任务项需要处理
- 如果某个Pod正常结束,则Job不会再启动新的Pod
- 如果一个Pod成功结束,则此时应该不存在其他Pod还在干活的情况,它们应该都处于即将结束、退出的状态
- 如果所有的Pod都结束了,且至少一个Pod成功结束,则整个Job算是成功结束
参考文章
- 《Kubernetes权威指南》
3.3.7 - Pod伸缩与升级
1. Pod伸缩
k8s中RC的用来保持集群中始终运行指定数目的实例,通过RC的scale机制可以完成Pod的扩容和缩容(伸缩)。
1.1. 手动伸缩(scale)
kubectl scale rc redis-slave --replicas=3
1.2. 自动伸缩(HPA)
Horizontal Pod Autoscaler(HPA)控制器用于实现基于CPU使用率进行自动Pod伸缩的功能。HPA控制器基于Master的kube-controller-manager服务启动参数--horizontal-pod-autoscaler-sync-period定义是时长(默认30秒),周期性监控目标Pod的CPU使用率,并在满足条件时对ReplicationController或Deployment中的Pod副本数进行调整,以符合用户定义的平均Pod CPU使用率。Pod CPU使用率来源于heapster组件,因此需安装该组件。
可以通过kubectl autoscale命令进行快速创建或者使用yaml配置文件进行创建。创建之前需已存在一个RC或Deployment对象,并且该RC或Deployment中的Pod必须定义resources.requests.cpu的资源请求值,以便heapster采集到该Pod的CPU。
1.2.1. 通过kubectl autoscale创建
例如:
php-apache-rc.yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: php-apache
spec:
replicas: 1
template:
metadata:
name: php-apache
labels:
app: php-apache
spec:
containers:
- name: php-apache
image: gcr.io/google_containers/hpa-example
resources:
requests:
cpu: 200m
ports:
- containerPort: 80
创建php-apache的RC
kubectl create -f php-apache-rc.yaml
php-apache-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: php-apache
spec:
ports:
- port: 80
selector:
app: php-apache
创建php-apache的Service
kubectl create -f php-apache-svc.yaml
创建HPA控制器
kubectl autoscale rc php-apache --min=1 --max=10 --cpu-percent=50
1.2.2. 通过yaml配置文件创建
hpa-php-apache.yaml
apiVersion: v1
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: v1
kind: ReplicationController
name: php-apache
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
创建hpa
kubectl create -f hpa-php-apache.yaml
查看hpa
kubectl get hpa
2. Pod滚动升级
k8s中的滚动升级通过执行kubectl rolling-update命令完成,该命令创建一个新的RC(与旧的RC在同一个命名空间中),然后自动控制旧的RC中的Pod副本数逐渐减少为0,同时新的RC中的Pod副本数从0逐渐增加到附加值,但滚动升级中Pod副本数(包括新Pod和旧Pod)保持原预期值。
2.1. 通过配置文件实现
redis-master-controller-v2.yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: redis-master-v2
labels:
name: redis-master
version: v2
spec:
replicas: 1
selector:
name: redis-master
version: v2
template:
metadata:
labels:
name: redis-master
version: v2
spec:
containers:
- name: master
image: kubeguide/redis-master:2.0
ports:
- containerPort: 6371
注意事项:
- RC的名字(name)不能与旧RC的名字相同
- 在selector中应至少有一个Label与旧的RC的Label不同,以标识其为新的RC。例如本例中新增了version的Label。
运行kubectl rolling-update
kubectl rolling-update redis-master -f redis-master-controller-v2.yaml
2.2. 通过kubectl rolling-update命令实现
kubectl rolling-update redis-master --image=redis-master:2.0
与使用配置文件实现不同在于,该执行结果旧的RC被删除,新的RC仍使用旧的RC的名字。
2.3. 升级回滚
kubectl rolling-update加参数--rollback实现回滚操作
kubectl rolling-update redis-master --image=kubeguide/redis-master:2.0 --rollback
参考文章
- 《Kubernetes权威指南》
3.4 - 配置
3.4.1 - ConfigMap
Pod的配置管理
Kubernetes v1.2的版本提供统一的集群配置管理方案–ConfigMap。
1. ConfigMap:容器应用的配置管理
使用场景:
- 生成为容器内的环境变量。
- 设置容器启动命令的启动参数(需设置为环境变量)。
- 以Volume的形式挂载为容器内部的文件或目录。
ConfigMap以一个或多个key:value的形式保存在kubernetes系统中供应用使用,既可以表示一个变量的值(例如:apploglevel=info),也可以表示完整配置文件的内容(例如:server.xml=<?xml...>...)。
可以通过yaml配置文件或者使用kubectl create configmap命令的方式创建ConfigMap。
2. 创建ConfigMap
2.1. 通过yaml文件方式
cm-appvars.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cm-appvars
data:
apploglevel: info
appdatadir: /var/data
常用命令
kubectl create -f cm-appvars.yaml
kubectl get configmap
kubectl describe configmap cm-appvars
kubectl get configmap cm-appvars -o yaml
2.2. 通过kubectl命令行方式
通过kubectl create configmap创建,使用参数--from-file或--from-literal指定内容,可以在一行中指定多个参数。
1)通过--from-file参数从文件中进行创建,可以指定key的名称,也可以在一个命令行中创建包含多个key的ConfigMap。
kubectl create configmap NAME --from-file=[key=]source --from-file=[key=]source
2)通过--from-file参数从目录中进行创建,该目录下的每个配置文件名被设置为key,文件内容被设置为value。
kubectl create configmap NAME --from-file=config-files-dir
3)通过--from-literal从文本中进行创建,直接将指定的key=value创建为ConfigMap的内容。
kubectl create configmap NAME --from-literal=key1=value1 --from-literal=key2=value2
容器应用对ConfigMap的使用有两种方法:
- 通过环境变量获取ConfigMap中的内容。
- 通过Volume挂载的方式将ConfigMap中的内容挂载为容器内部的文件或目录。
2.3. 通过环境变量的方式
ConfigMap的yaml文件:cm-appvars.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cm-appvars
data:
apploglevel: info
appdatadir: /var/data
Pod的yaml文件:cm-test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: cm-test-pod
spec:
containers:
- name: cm-test
image: busybox
command: ["/bin/sh","-c","env|grep APP"]
env:
- name: APPLOGLEVEL
valueFrom:
configMapKeyRef:
name: cm-appvars
key: apploglevel
- name: APPDATADIR
valueFrom:
configMapKeyRef:
name: cm-appvars
key: appdatadir
创建命令:
kubectl create -f cm-test-pod.yaml
kubectl get pods --show-all
kubectl logs cm-test-pod
3. 使用ConfigMap的限制条件
- ConfigMap必须在Pod之前创建
- ConfigMap也可以定义为属于某个Namespace。只有处于相同Namespace中的Pod可以引用它。
- kubelet只支持可以被API Server管理的Pod使用ConfigMap。静态Pod无法引用。
- 在Pod对ConfigMap进行挂载操作时,容器内只能挂载为“目录”,无法挂载为文件。
参考文章
- 《Kubernetes权威指南》
4 - 核心原理
4.1 - 核心组件
4.1.1 - Kubernetes核心原理(一)之API Server
1. API Server简介
k8s API Server提供了k8s各类资源对象(pod,RC,Service等)的增删改查及watch等HTTP Rest接口,是整个系统的数据总线和数据中心。
kubernetes API Server的功能:
- 提供了集群管理的REST API接口(包括认证授权、数据校验以及集群状态变更);
- 提供其他模块之间的数据交互和通信的枢纽(其他模块通过API Server查询或修改数据,只有API Server才直接操作etcd);
- 是资源配额控制的入口;
- 拥有完备的集群安全机制.
kube-apiserver工作原理图

2. 如何访问kubernetes API
k8s通过kube-apiserver这个进程提供服务,该进程运行在单个k8s-master节点上。默认有两个端口。
2.1. 本地端口
- 该端口用于接收HTTP请求;
- 该端口默认值为8080,可以通过API Server的启动参数“--insecure-port”的值来修改默认值;
- 默认的IP地址为“localhost”,可以通过启动参数“--insecure-bind-address”的值来修改该IP地址;
- 非认证或授权的HTTP请求通过该端口访问API Server。
2.2. 安全端口
- 该端口默认值为6443,可通过启动参数“--secure-port”的值来修改默认值;
- 默认IP地址为非本地(Non-Localhost)网络端口,通过启动参数“--bind-address”设置该值;
- 该端口用于接收HTTPS请求;
- 用于基于Tocken文件或客户端证书及HTTP Base的认证;
- 用于基于策略的授权;
- 默认不启动HTTPS安全访问控制。
2.3. 访问方式
Kubernetes REST API可参考https://kubernetes.io/docs/api-reference/v1.6/
2.3.1. curl
curl localhost:8080/api
curl localhost:8080/api/v1/pods
curl localhost:8080/api/v1/services
curl localhost:8080/api/v1/replicationcontrollers
2.3.2. Kubectl Proxy
Kubectl Proxy代理程序既能作为API Server的反向代理,也能作为普通客户端访问API Server的代理。通过master节点的8080端口来启动该代理程序。
kubectl proxy --port=8080 &
具体见kubectl proxy --help
[root@node5 ~]# kubectl proxy --help
To proxy all of the kubernetes api and nothing else, use:
kubectl proxy --api-prefix=/
To proxy only part of the kubernetes api and also some static files:
kubectl proxy --www=/my/files --www-prefix=/static/ --api-prefix=/api/
The above lets you 'curl localhost:8001/api/v1/pods'.
To proxy the entire kubernetes api at a different root, use:
kubectl proxy --api-prefix=/custom/
The above lets you 'curl localhost:8001/custom/api/v1/pods'
Usage:
kubectl proxy [--port=PORT] [--www=static-dir] [--www-prefix=prefix] [--api-prefix=prefix] [flags]
Examples:
# Run a proxy to kubernetes apiserver on port 8011, serving static content from ./local/www/
$ kubectl proxy --port=8011 --www=./local/www/
# Run a proxy to kubernetes apiserver on an arbitrary local port.
# The chosen port for the server will be output to stdout.
$ kubectl proxy --port=0
# Run a proxy to kubernetes apiserver, changing the api prefix to k8s-api
# This makes e.g. the pods api available at localhost:8011/k8s-api/v1/pods/
$ kubectl proxy --api-prefix=/k8s-api
Flags:
--accept-hosts="^localhost$,^127//.0//.0//.1$,^//[::1//]$": Regular expression for hosts that the proxy should accept.
--accept-paths="^/.*": Regular expression for paths that the proxy should accept.
--api-prefix="/": Prefix to serve the proxied API under.
--disable-filter[=false]: If true, disable request filtering in the proxy. This is dangerous, and can leave you vulnerable to XSRF attacks, when used with an accessible port.
-p, --port=8001: The port on which to run the proxy. Set to 0 to pick a random port.
--reject-methods="POST,PUT,PATCH": Regular expression for HTTP methods that the proxy should reject.
--reject-paths="^/api/.*/exec,^/api/.*/run": Regular expression for paths that the proxy should reject.
-u, --unix-socket="": Unix socket on which to run the proxy.
-w, --www="": Also serve static files from the given directory under the specified prefix.
-P, --www-prefix="/static/": Prefix to serve static files under, if static file directory is specified.
Global Flags:
--alsologtostderr[=false]: log to standard error as well as files
--api-version="": The API version to use when talking to the server
--certificate-authority="": Path to a cert. file for the certificate authority.
--client-certificate="": Path to a client key file for TLS.
--client-key="": Path to a client key file for TLS.
--cluster="": The name of the kubeconfig cluster to use
--context="": The name of the kubeconfig context to use
--insecure-skip-tls-verify[=false]: If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure.
--kubeconfig="": Path to the kubeconfig file to use for CLI requests.
--log-backtrace-at=:0: when logging hits line file:N, emit a stack trace
--log-dir="": If non-empty, write log files in this directory
--log-flush-frequency=5s: Maximum number of seconds between log flushes
--logtostderr[=true]: log to standard error instead of files
--match-server-version[=false]: Require server version to match client version
--namespace="": If present, the namespace scope for this CLI request.
--password="": Password for basic authentication to the API server.
-s, --server="": The address and port of the Kubernetes API server
--stderrthreshold=2: logs at or above this threshold go to stderr
--token="": Bearer token for authentication to the API server.
--user="": The name of the kubeconfig user to use
--username="": Username for basic authentication to the API server.
--v=0: log level for V logs
--vmodule=: comma-separated list of pattern=N settings for file-filtered logging
2.3.3. kubectl客户端
命令行工具kubectl客户端,通过命令行参数转换为对API Server的REST API调用,并将调用结果输出。
命令格式:kubectl [command] [options]
具体可参考k8s常用命令
2.3.4. 编程方式调用
使用场景:
1、运行在Pod里的用户进程调用kubernetes API,通常用来实现分布式集群搭建的目标。
2、开发基于kubernetes的管理平台,比如调用kubernetes API来完成Pod、Service、RC等资源对象的图形化创建和管理界面。可以使用kubernetes提供的Client Library。
具体可参考https://github.com/kubernetes/client-go。
3. 通过API Server访问Node、Pod和Service
k8s API Server最主要的REST接口是资源对象的增删改查,另外还有一类特殊的REST接口—k8s Proxy API接口,这类接口的作用是代理REST请求,即kubernetes API Server把收到的REST请求转发到某个Node上的kubelet守护进程的REST端口上,由该kubelet进程负责响应。
3.1. Node相关接口
关于Node相关的接口的REST路径为:/api/v1/proxy/nodes/{name},其中{name}为节点的名称或IP地址。
/api/v1/proxy/nodes/{name}/pods/ #列出指定节点内所有Pod的信息
/api/v1/proxy/nodes/{name}/stats/ #列出指定节点内物理资源的统计信息
/api/v1/prxoy/nodes/{name}/spec/ #列出指定节点的概要信息
这里获取的Pod信息来自Node而非etcd数据库,两者时间点可能存在偏差。如果在kubelet进程启动时加--enable-debugging-handles=true参数,那么kubernetes Proxy API还会增加以下接口:
/api/v1/proxy/nodes/{name}/run #在节点上运行某个容器
/api/v1/proxy/nodes/{name}/exec #在节点上的某个容器中运行某条命令
/api/v1/proxy/nodes/{name}/attach #在节点上attach某个容器
/api/v1/proxy/nodes/{name}/portForward #实现节点上的Pod端口转发
/api/v1/proxy/nodes/{name}/logs #列出节点的各类日志信息
/api/v1/proxy/nodes/{name}/metrics #列出和该节点相关的Metrics信息
/api/v1/proxy/nodes/{name}/runningpods #列出节点内运行中的Pod信息
/api/v1/proxy/nodes/{name}/debug/pprof #列出节点内当前web服务的状态,包括CPU和内存的使用情况
3.2. Pod相关接口
/api/v1/proxy/namespaces/{namespace}/pods/{name}/{path:*} #访问pod的某个服务接口
/api/v1/proxy/namespaces/{namespace}/pods/{name} #访问Pod
#以下写法不同,功能一样
/api/v1/namespaces/{namespace}/pods/{name}/proxy/{path:*} #访问pod的某个服务接口
/api/v1/namespaces/{namespace}/pods/{name}/proxy #访问Pod
3.3. Service相关接口
/api/v1/proxy/namespaces/{namespace}/services/{name}
Pod的proxy接口的作用:在kubernetes集群之外访问某个pod容器的服务(HTTP服务),可以用Proxy API实现,这种场景多用于管理目的,比如逐一排查Service的Pod副本,检查哪些Pod的服务存在异常问题。
4. 集群功能模块之间的通信
kubernetes API Server作为集群的核心,负责集群各功能模块之间的通信,集群内各个功能模块通过API Server将信息存入etcd,当需要获取和操作这些数据时,通过API Server提供的REST接口(GET/LIST/WATCH方法)来实现,从而实现各模块之间的信息交互。
4.1. kubelet与API Server交互
每个Node节点上的kubelet定期就会调用API Server的REST接口报告自身状态,API Server接收这些信息后,将节点状态信息更新到etcd中。kubelet也通过API Server的Watch接口监听Pod信息,从而对Node机器上的POD进行管理。
| 监听信息 | kubelet动作 |
|---|---|
| 新的POD副本被调度绑定到本节点 | 执行POD对应的容器的创建和启动逻辑 |
| POD对象被删除 | 删除本节点上相应的POD容器 |
| 修改POD信息 | 修改本节点的POD容器 |
4.2. kube-controller-manager与API Server交互
kube-controller-manager中的Node Controller模块通过API Server提供的Watch接口,实时监控Node的信息,并做相应处理。
4.3. kube-scheduler与API Server交互
Scheduler通过API Server的Watch接口监听到新建Pod副本的信息后,它会检索所有符合该Pod要求的Node列表,开始执行Pod调度逻辑。调度成功后将Pod绑定到目标节点上。
4.4. 特别说明
为了缓解各模块对API Server的访问压力,各功能模块都采用缓存机制来缓存数据,各功能模块定时从API Server获取指定的资源对象信息(LIST/WATCH方法),然后将信息保存到本地缓存,功能模块在某些情况下不直接访问API Server,而是通过访问缓存数据来间接访问API Server。
参考《kubernetes权威指南》
4.1.2 - Kubernetes核心原理(二)之Controller Manager
1. Controller Manager简介
Controller Manager作为集群内部的管理控制中心,负责集群内的Node、Pod副本、服务端点(Endpoint)、命名空间(Namespace)、服务账号(ServiceAccount)、资源定额(ResourceQuota)的管理,当某个Node意外宕机时,Controller Manager会及时发现并执行自动化修复流程,确保集群始终处于预期的工作状态。

每个Controller通过API Server提供的接口实时监控整个集群的每个资源对象的当前状态,当发生各种故障导致系统状态发生变化时,会尝试将系统状态修复到“期望状态”。
2. Replication Controller
为了区分,将资源对象Replication Controller简称RC,而本文中是指Controller Manager中的Replication Controller,称为副本控制器。副本控制器的作用即保证集群中一个RC所关联的Pod副本数始终保持预设值。
- 只有当Pod的重启策略是Always的时候(RestartPolicy=Always),副本控制器才会管理该Pod的操作(创建、销毁、重启等)。
- RC中的Pod模板就像一个模具,模具制造出来的东西一旦离开模具,它们之间就再没关系了。一旦Pod被创建,无论模板如何变化,也不会影响到已经创建的Pod。
- Pod可以通过修改label来脱离RC的管控,该方法可以用于将Pod从集群中迁移,数据修复等调试。
- 删除一个RC不会影响它所创建的Pod,如果要删除Pod需要将RC的副本数属性设置为0。
- 不要越过RC创建Pod,因为RC可以实现自动化控制Pod,提高容灾能力。
2.1. Replication Controller的职责
- 确保集群中有且仅有N个Pod实例,N是RC中定义的Pod副本数量。
- 通过调整RC中的spec.replicas属性值来实现系统扩容或缩容。
- 通过改变RC中的Pod模板来实现系统的滚动升级。
2.2. Replication Controller使用场景
| 使用场景 | 说明 | 使用命令 |
|---|---|---|
| 重新调度 | 当发生节点故障或Pod被意外终止运行时,可以重新调度保证集群中仍然运行指定的副本数。 | |
| 弹性伸缩 | 通过手动或自动扩容代理修复副本控制器的spec.replicas属性,可以实现弹性伸缩。 | kubectl scale |
| 滚动更新 | 创建一个新的RC文件,通过kubectl 命令或API执行,则会新增一个新的副本同时删除旧的副本,当旧副本为0时,删除旧的RC。 | kubectl rolling-update |
滚动升级,具体可参考kubectl rolling-update --help,官方文档:https://kubernetes.io/docs/tasks/run-application/rolling-update-replication-controller/
3. Node Controller
kubelet在启动时会通过API Server注册自身的节点信息,并定时向API Server汇报状态信息,API Server接收到信息后将信息更新到etcd中。
Node Controller通过API Server实时获取Node的相关信息,实现管理和监控集群中的各个Node节点的相关控制功能。流程如下

1、Controller Manager在启动时如果设置了--cluster-cidr参数,那么为每个没有设置Spec.PodCIDR的Node节点生成一个CIDR地址,并用该CIDR地址设置节点的Spec.PodCIDR属性,防止不同的节点的CIDR地址发生冲突。
2、具体流程见以上流程图。
3、逐个读取节点信息,如果节点状态变成非“就绪”状态,则将节点加入待删除队列,否则将节点从该队列删除。
4. ResourceQuota Controller
资源配额管理确保指定的资源对象在任何时候都不会超量占用系统物理资源。
支持三个层次的资源配置管理:
1)容器级别:对CPU和Memory进行限制
2)Pod级别:对一个Pod内所有容器的可用资源进行限制
3)Namespace级别:包括
- Pod数量
- Replication Controller数量
- Service数量
- ResourceQuota数量
- Secret数量
- 可持有的PV(Persistent Volume)数量
说明:
- k8s配额管理是通过Admission Control(准入控制)来控制的;
- Admission Control提供两种配额约束方式:LimitRanger和ResourceQuota;
- LimitRanger作用于Pod和Container;
- ResourceQuota作用于Namespace上,限定一个Namespace里的各类资源的使用总额。
ResourceQuota Controller流程图:

5. Namespace Controller
用户通过API Server可以创建新的Namespace并保存在etcd中,Namespace Controller定时通过API Server读取这些Namespace信息。
如果Namespace被API标记为优雅删除(即设置删除期限,DeletionTimestamp),则将该Namespace状态设置为“Terminating”,并保存到etcd中。同时Namespace Controller删除该Namespace下的ServiceAccount、RC、Pod等资源对象。
6. Endpoint Controller
Service、Endpoint、Pod的关系:

Endpoints表示了一个Service对应的所有Pod副本的访问地址,而Endpoints Controller负责生成和维护所有Endpoints对象的控制器。它负责监听Service和对应的Pod副本的变化。
- 如果监测到Service被删除,则删除和该Service同名的Endpoints对象;
- 如果监测到新的Service被创建或修改,则根据该Service信息获得相关的Pod列表,然后创建或更新Service对应的Endpoints对象。
- 如果监测到Pod的事件,则更新它对应的Service的Endpoints对象。
kube-proxy进程获取每个Service的Endpoints,实现Service的负载均衡功能。
7. Service Controller
Service Controller是属于kubernetes集群与外部的云平台之间的一个接口控制器。Service Controller监听Service变化,如果是一个LoadBalancer类型的Service,则确保外部的云平台上对该Service对应的LoadBalancer实例被相应地创建、删除及更新路由转发表。
参考《Kubernetes权威指南》
4.1.3 - Kubernetes核心原理(三)之Scheduler
1. Scheduler简介
Scheduler负责Pod调度。在整个系统中起"承上启下"作用,承上:负责接收Controller Manager创建的新的Pod,为其选择一个合适的Node;启下:Node上的kubelet接管Pod的生命周期。
Scheduler:
1)通过调度算法为待调度Pod列表的每个Pod从Node列表中选择一个最适合的Node,并将信息写入etcd中
2)kubelet通过API Server监听到kubernetes Scheduler产生的Pod绑定信息,然后获取对应的Pod清单,下载Image,并启动容器。

2. 调度流程
1、预选调度过程,即遍历所有目标Node,筛选出符合要求的候选节点,kubernetes内置了多种预选策略(xxx Predicates)供用户选择
2、确定最优节点,在第一步的基础上采用优选策略(xxx Priority)计算出每个候选节点的积分,取最高积分。
调度流程通过插件式加载的“调度算法提供者”(AlgorithmProvider)具体实现,一个调度算法提供者就是包括一组预选策略与一组优选策略的结构体。
3. 预选策略
说明:返回true表示该节点满足该Pod的调度条件;返回false表示该节点不满足该Pod的调度条件。
3.1. NoDiskConflict
判断备选Pod的数据卷是否与该Node上已存在Pod挂载的数据卷冲突,如果是则返回false,否则返回true。
3.2. PodFitsResources
判断备选节点的资源是否满足备选Pod的需求,即节点的剩余资源满不满足该Pod的资源使用。
- 计算备选Pod和节点中已用资源(该节点所有Pod的使用资源)的总和。
- 获取备选节点的状态信息,包括节点资源信息。
- 如果(备选Pod+节点已用资源>该节点总资源)则返回false,即剩余资源不满足该Pod使用;否则返回true。
3.3. PodSelectorMatches
判断节点是否包含备选Pod的标签选择器指定的标签,即通过标签来选择Node。
- 如果Pod中没有指定spec.nodeSelector,则返回true。
- 否则获得备选节点的标签信息,判断该节点的标签信息中是否包含该Pod的spec.nodeSelector中指定的标签,如果包含返回true,否则返回false。
3.4. PodFitsHost
判断备选Pod的spec.nodeName所指定的节点名称与备选节点名称是否一致,如果一致返回true,否则返回false。
3.5. CheckNodeLabelPresence
检查备选节点中是否有Scheduler配置的标签,如果有返回true,否则返回false。
3.6. CheckServiceAffinity
判断备选节点是否包含Scheduler配置的标签,如果有返回true,否则返回false。
3.7. PodFitsPorts
判断备选Pod所用的端口列表中的端口是否在备选节点中已被占用,如果被占用返回false,否则返回true。
4. 优选策略
4.1. LeastRequestedPriority
优先从备选节点列表中选择资源消耗最小的节点(CPU+内存)。
4.2. CalculateNodeLabelPriority
优先选择含有指定Label的节点。
4.3. BalancedResourceAllocation
优先从备选节点列表中选择各项资源使用率最均衡的节点。
参考《Kubernetes权威指南》
4.1.4 - Kubernetes核心原理(四)之kubelet
1. kubelet简介
在kubernetes集群中,每个Node节点都会启动kubelet进程,用来处理Master节点下发到本节点的任务,管理Pod和其中的容器。kubelet会在API Server上注册节点信息,定期向Master汇报节点资源使用情况,并通过cAdvisor监控容器和节点资源。可以把kubelet理解成【Server-Agent】架构中的agent,是Node上的pod管家。
更多kubelet配置参数信息可参考kubelet --help
2. 节点管理
节点通过设置kubelet的启动参数“--register-node”,来决定是否向API Server注册自己,默认为true。可以通过kubelet --help或者查看kubernetes源码【cmd/kubelet/app/server.go中】来查看该参数。
kubelet的配置文件
默认配置文件在/etc/kubernetes/kubelet中,其中
- --api-servers:用来配置Master节点的IP和端口。
- --kubeconfig:用来配置kubeconfig的路径,kubeconfig文件常用来指定证书。
- --hostname-override:用来配置该节点在集群中显示的主机名。
- --node-status-update-frequency:配置kubelet向Master心跳上报的频率,默认为10s。
3. Pod管理
kubelet有几种方式获取自身Node上所需要运行的Pod清单。但本文只讨论通过API Server监听etcd目录,同步Pod列表的方式。
kubelet通过API Server Client使用WatchAndList的方式监听etcd中/registry/nodes/${当前节点名称}和/registry/pods的目录,将获取的信息同步到本地缓存中。
kubelet监听etcd,执行对Pod的操作,对容器的操作则是通过Docker Client执行,例如启动删除容器等。
kubelet创建和修改Pod流程:
- 为该Pod创建一个数据目录。
- 从API Server读取该Pod清单。
- 为该Pod挂载外部卷(External Volume)
- 下载Pod用到的Secret。
- 检查运行的Pod,执行Pod中未完成的任务。
- 先创建一个Pause容器,该容器接管Pod的网络,再创建其他容器。
- Pod中容器的处理流程: 1)比较容器hash值并做相应处理。 2)如果容器被终止了且没有指定重启策略,则不做任何处理。 3)调用Docker Client下载容器镜像,调用Docker Client运行容器。
4. 容器健康检查
Pod通过探针的方式来检查容器的健康状态,具体可参考Pod详解#Pod健康检查。
5. cAdvisor资源监控
kubelet通过cAdvisor获取本节点信息及容器的数据。cAdvisor为谷歌开源的容器资源分析工具,默认集成到kubernetes中。
cAdvisor自动采集CPU,内存,文件系统,网络使用情况,容器中运行的进程,默认端口为4194。可以通过Node IP+Port访问。
更多参考:http://github.com/google/cadvisor
参考《Kubernetes权威指南》
4.2 - 流程图
4.2.1 - Pod创建流程
Pod创建基本流程图
Pod创建完整流程图
图片来源:https://fuckcloudnative.io/posts/what-happens-when-k8s/
参考:
4.2.2 - PVC创建流程
pvc流程
流程如下:
- 用户创建了一个包含 PVC 的 Pod,该 PVC 要求使用动态存储卷;
- Scheduler 根据 Pod 配置、节点状态、PV 配置等信息,把 Pod 调度到一个合适的 Worker 节点上;
- PV 控制器 watch 到该 Pod 使用的 PVC 处于 Pending 状态,于是调用 Volume Plugin(in-tree)创建存储卷,并创建 PV 对象(out-of-tree 由 External Provisioner 来处理);
- AD 控制器发现 Pod 和 PVC 处于待挂接状态,于是调用 Volume Plugin 挂接存储设备到目标 Worker 节点上
- 在 Worker 节点上,Kubelet 中的 Volume Manager 等待存储设备挂接完成,并通过 Volume Plugin 将设备挂载到全局目录:/var/lib/kubelet/pods/[pod uid]/volumes/kubernetes.io~iscsi/[PVname](以 iscsi 为例);
- Kubelet 通过 Docker 启动 Pod 的 Containers,用 bind mount 方式将已挂载到本地全局目录的卷映射到容器中。
详细流程图
5 - 容器网络
5.1 - Docker网络
1. Docker的网络基础
1.1. Network Namespace
不同的网络命名空间中,协议栈是独立的,完全隔离,彼此之间无法通信。同一个网络命名空间有独立的路由表和独立的Iptables/Netfilter来提供包的转发、NAT、IP包过滤等功能。
1.1.1. 网络命名空间的实现
将与网络协议栈相关的全局变量变成一个Net Namespace变量的成员,然后在调用协议栈函数中加入一个Namepace参数。
1.1.2. 网络命名空间的操作
1、创建网络命名空间
ip netns add name
2、在命名空间内执行命令
ip netns exec name command
3、进入命名空间
ip netns exec name bash
2. Docker的网络实现
2.1. 容器网络
Docker使用Linux桥接,在宿主机虚拟一个Docker容器网桥(docker0),Docker启动一个容器时会根据Docker网桥的网段分配给容器一个IP地址,称为Container-IP,同时Docker网桥是每个容器的默认网关。因为在同一宿主机内的容器都接入同一个网桥,这样容器之间就能够通过容器的Container-IP直接通信。
Docker网桥是宿主机虚拟出来的,并不是真实存在的网络设备,外部网络是无法寻址到的,这也意味着外部网络无法通过直接Container-IP访问到容器。如果容器希望外部访问能够访问到,可以通过映射容器端口到宿主主机(端口映射),即docker run创建容器时候通过 -p 或 -P 参数来启用,访问容器的时候就通过[宿主机IP]:[容器端口]访问容器。

2.2. 4类网络模式
| Docker网络模式 | 配置 | 说明 |
|---|---|---|
| host模式 | --net=host | 容器和宿主机共享Network namespace。 |
| container模式 | --net=container:NAME_or_ID | 容器和另外一个容器共享Network namespace。 kubernetes中的pod就是多个容器共享一个Network namespace。 |
| none模式 | --net=none | 容器有独立的Network namespace,但并没有对其进行任何网络设置,如分配veth pair 和网桥连接,配置IP等。 |
| bridge模式 | --net=bridge(默认为该模式) | 桥接模式 |
3. Docker网络模式
3.1. bridge桥接模式
在bridge模式下,Docker可以使用独立的网络栈。实现方式是父进程在创建子进程的时候通过传入CLONE_NEWNET的参数创建出一个网络命名空间。
实现步骤:
- Docker Daemon首次启动时会创建一个虚拟网桥docker0,地址通常为172.x.x.x开头,在私有的网络空间中给这个网络分配一个子网。
- 由Docker创建处理的每个容器,都会创建一个虚拟以太设备对(veth pair),一端关联到网桥,另一端使用Namespace技术映射到容器内的eth0设备,然后从网桥的地址段内给eth0接口分配一个IP地址。

一般情况,宿主机IP与docker0 IP、容器IP是不同的IP段,默认情况,外部看不到docker0和容器IP,对于外部来说相当于docker0和容器的IP为内网IP。
3.1.1. 外部网络访问Docker容器
外部访问docker容器可以通过端口映射(NAT)的方式,Docker使用NAT的方式将容器内部的服务与宿主机的某个端口port_1绑定。
外部访问容器的流程如下:
- 外界网络通过宿主机的IP和映射的端口port_1访问。
- 当宿主机收到此类请求,会通过DNAT将请求的目标IP即宿主机IP和目标端口即映射端口port_1替换成容器的IP和容器的端口port_0。
- 由于宿主机上可以识别容器IP,所以宿主机将请求发给veth pair。
- veth pair将请求发送给容器内部的eth0,由容器内部的服务进行处理。
3.1.2. Docker容器访问外部网络
docker容器访问外部网络的流程:
-
docker容器向外部目标IP和目标端口port_2发起请求,请求报文中的源IP为容器IP。
-
请求通过容器内部的eth0到veth pair的另一端docker0网桥。
-
docker0网桥通过数据报转发功能将请求转发到宿主机的eth0。
-
宿主机处理请求时通过SNAT将请求中的源IP换成宿主机eth0的IP。
-
处理后的报文通过请求的目标IP发送到外部网络。
3.1.3. 缺点
使用NAT的方式可能会带来性能的问题,影响网络传输效率。
3.2. host模式
host模式并没有给容器创建一个隔离的网络环境,而是和宿主机共用一个网络命名空间,容器使用宿主机的eth0和外界进行通信,同样容器也共用宿主机的端口资源,即分配端口可能存在与宿主机已分配的端口冲突的问题。
实现的方式即父进程在创建子进程的时候不传入CLONE_NEWNET的参数,从而和宿主机共享一个网络空间。
host模式没有通过NAT的方式进行转发因此性能上相对较好,但是不存在网络隔离性,可能产生端口冲突的问题。
3.3. container模式
container模式即docker容器可以使用其他容器的网络命名空间,即和其他容器处于同一个网络命名空间。
步骤:
- 查找其他容器的网络命名空间。
- 新创建的容器的网络命名空间使用其他容器的网络命名空间。
通过和其他容器共享网络命名空间的方式,可以让不同的容器之间处于相同的网络命名空间,可以直接通过localhost的方式进行通信,简化了强关联的多个容器之间的通信问题。
k8s中的pod的概念就是通过一组容器共享一个网络命名空间来达到pod内部的不同容器可以直接通过localhost的方式进行通信。
3.4. none模式
none模式即不为容器创建任何的网络环境,用户可以根据自己的需要手动去创建不同的网络定制配置。
参考:
- 《Docker源码分析》
5.2 - K8S网络
1. kubernetes网络模型
1.1. 基础原则
- 每个Pod都拥有一个独立的IP地址,而且假定所有Pod都在一个可以直接连通的、扁平的网络空间中,不管是否运行在同一Node上都可以通过Pod的IP来访问。
- k8s中Pod的IP是最小粒度IP。同一个Pod内所有的容器共享一个网络堆栈,该模型称为IP-per-Pod模型。
- Pod由docker0实际分配的IP,Pod内部看到的IP地址和端口与外部保持一致。同一个Pod内的不同容器共享网络,可以通过localhost来访问对方的端口,类似同一个VM内的不同进程。
- IP-per-Pod模型从端口分配、域名解析、服务发现、负载均衡、应用配置等角度看,Pod可以看作是一台独立的VM或物理机。
1.2. k8s对集群的网络要求
- 所有容器都可以不用NAT的方式同别的容器通信。
- 所有节点都可以在不同NAT的方式下同所有容器通信,反之亦然。
- 容器的地址和别人看到的地址是同一个地址。
以上的集群网络要求可以通过第三方开源方案实现,例如flannel。
1.3. 网络架构图

1.4. k8s集群IP概念汇总
由集群外部到集群内部:
| IP类型 | 说明 |
|---|---|
| Proxy-IP | 代理层公网地址IP,外部访问应用的网关服务器。[实际需要关注的IP] |
| Service-IP | Service的固定虚拟IP,Service-IP是内部,外部无法寻址到。 |
| Node-IP | 容器宿主机的主机IP。 |
| Container-Bridge-IP | 容器网桥(docker0)IP,容器的网络都需要通过容器网桥转发。 |
| Pod-IP | Pod的IP,等效于Pod中网络容器的Container-IP。 |
| Container-IP | 容器的IP,容器的网络是个隔离的网络空间。 |
2. kubernetes的网络实现
k8s网络场景
- 容器与容器之间的直接通信。
- Pod与Pod之间的通信。
- Pod到Service之间的通信。
- 集群外部与内部组件之间的通信。
2.1. Pod网络
Pod作为kubernetes的最小调度单元,Pod是容器的集合,是一个逻辑概念,Pod包含的容器都运行在同一个宿主机上,这些容器将拥有同样的网络空间,容器之间能够互相通信,它们能够在本地访问其它容器的端口。 实际上Pod都包含一个网络容器,它不做任何事情,只是用来接管Pod的网络,业务容器通过加入网络容器的网络从而实现网络共享。Pod网络本质上还是容器网络,所以Pod-IP就是网络容器的Container-IP。
一般将容器云平台的网络模型打造成一个扁平化网络平面,在这个网络平面内,Pod作为一个网络单元同Kubernetes Node的网络处于同一层级。
2.2. Pod内部容器之间的通信
同一个Pod之间的不同容器因为共享同一个网络命名空间,所以可以直接通过localhost直接通信。
2.3. Pod之间的通信
2.3.1. 同Node的Pod之间的通信
同一个Node内,不同的Pod都有一个全局IP,可以直接通过Pod的IP进行通信。Pod地址和docker0在同一个网段。
在pause容器启动之前,会创建一个虚拟以太网接口对(veth pair),该接口对一端连着容器内部的eth0 ,一端连着容器外部的vethxxx,vethxxx会绑定到容器运行时配置使用的网桥bridge0上,从该网络的IP段中分配IP给容器的eth0。
当同节点上的Pod-A发包给Pod-B时,包传送路线如下:
pod-a的eth0—>pod-a的vethxxx—>bridge0—>pod-b的vethxxx—>pod-b的eth0
因为相同节点的bridge0是相通的,因此可以通过bridge0来完成不同pod直接的通信,但是不同节点的bridge0是不通的,因此不同节点的pod之间的通信需要将不同节点的bridge0给连接起来。
2.3.2. 不同Node的Pod之间的通信
不同的Node之间,Node的IP相当于外网IP,可以直接访问,而Node内的docker0和Pod的IP则是内网IP,无法直接跨Node访问。需要通过Node的网卡进行转发。
所以不同Node之间的通信需要达到两个条件:
- 对整个集群中的Pod-IP分配进行规划,不能有冲突(可以通过第三方开源工具来管理,例如flannel)。
- 将Node-IP与该Node上的Pod-IP关联起来,通过Node-IP再转发到Pod-IP。
不同节点的Pod之间的通信需要将不同节点的bridge0给连接起来。连接不同节点的bridge0的方式有好几种,主要有overlay和underlay,或常规的三层路由。
不同节点的bridge0需要不同的IP段,保证Pod IP分配不会冲突,节点的物理网卡eth0也要和该节点的网桥bridge0连接。因此,节点a上的pod-a发包给节点b上的pod-b,路线如下:
节点a上的pod-a的eth0—>pod-a的vethxxx—>节点a的bridge0—>节点a的eth0—>
节点b的eth0—>节点b的bridge0—>pod-b的vethxxx—>pod-b的eth0

1. Pod间实现通信
例如:Pod1和Pod2(同主机),Pod1和Pod3(跨主机)能够通信
实现:因为Pod的Pod-IP是Docker网桥分配的,Pod-IP是同Node下全局唯一的。所以将不同Kubernetes Node的 Docker网桥配置成不同的IP网段即可。
2. Node与Pod间实现通信
例如:Node1和Pod1/ Pod2(同主机),Pod3(跨主机)能够通信
实现:在容器集群中创建一个覆盖网络(Overlay Network),联通各个节点,目前可以通过第三方网络插件来创建覆盖网络,比如Flannel和Open vSwitch等。
不同节点间的Pod访问也可以通过calico形成的Pod IP的路由表来解决。
2.4. Service网络
Service的就是在Pod之间起到服务代理的作用,对外表现为一个单一访问接口,将请求转发给Pod,Service的网络转发是Kubernetes实现服务编排的关键一环。Service都会生成一个虚拟IP,称为Service-IP, Kuberenetes Porxy组件负责实现Service-IP路由和转发,在容器覆盖网络之上又实现了虚拟转发网络。
Kubernetes Porxy实现了以下功能:
- 转发访问Service的Service-IP的请求到Endpoints(即Pod-IP)。
- 监控Service和Endpoints的变化,实时刷新转发规则。
- 负载均衡能力。
3. 开源的网络组件
3.1. Flannel
具体参考Flannel介绍
参考《Kubernetes权威指南》
5.3 - CNI
5.3.1 - CNI接口介绍
1. CNI(Container Network Interface)
CNI(Container Network Interface)即容器网络接口,通过约定统一的容器网络接口,从而kubelet可以通过这个标准的API来调用不同的网络插件实现不同的网络功能。
kubelet启动参数--network-plugin=cni来指定CNI插件,kubelet从--cni-conf-dir (默认是 /etc/cni/net.d) 读取文件并使用 该文件中的 CNI 配置来设置各个 Pod 的网络。 CNI 配置文件必须与 CNI 规约 匹配,并且配置所引用的所有所需的 CNI 插件都应存在于 --cni-bin-dir(默认是 /opt/cni/bin)下。如果有多个CNI配置文件,kubelet 将会使用按文件名的字典顺序排列 的第一个作为配置文件。
CNI规范定义:
-
网络配置文件的格式
-
容器runtime与CNI插件的通信协议
-
基于提供的配置执行网络插件的步骤
-
网络插件调用其他功能插件的步骤
-
插件返回给runtime结果的数据格式
2. CNI配置文件格式
CNI配置文件的格式为JSON格式,配置文件的默认路径:/etc/cni/net.d。插件二进制默认的路径为:/opt/cni/bin。
2.1. 主配置的字段
-
cniVersion(string):CNI规范使用的版本,例如版本为0.4.0。 -
name(string):目标网络的名称。 -
disableCheck(boolean):关闭CHECK操作。 -
plugins(list):CNI插件列表及插件配置。
2.2. 插件配置字段
根据不同的插件,插件配置所需的字段不同。
必选字段:
type(string):节点上插件二进制的名称,比如bridge,sriov,macvlan等。
可选字段:
-
capabilities(dictionary) -
ipMasq(boolean):为目标网络配上Outbound Masquerade(地址伪装),即:由容器内部通过网关向外发送数据包时,对数据包的源IP地址进行修改。当我们的容器以宿主机作为网关时,这个参数是必须要设置的。否则,从容器内部发出的数据包就没有办法通过网关路由到其他网段。因为容器内部的IP地址无法被目标网段识别,所以这些数据包最终会被丢弃掉。
-
ipam(dictionary):IPAM(IP Adderss Management)即IP地址管理,提供了一系列方法用于对IP和路由进行管理。它对应的是由CNI提供的一组标准IPAM插件,比如像host-local,dhcp,static等。比如文中用到的bridge插件,会调用我们所指定的IPAM插件,实现对网络设备IP地址的分配和管理。**如果是自己开发的ipam插件,则相关的入参可以自己定义和实现。以下以host-local为例说明。
- type:指定所用IPAM插件的名称,在我们的例子里,用的是host-local。
- subnet:为目标网络分配网段,包括网络ID和子网掩码,以CIDR形式标记。在我们的例子里为
10.15.10.0/24,也就是目标网段为10.15.10.0,子网掩码为255.255.255.0。 - routes:用于指定路由规则,插件会为我们在容器的路由表里生成相应的规则。其中,dst表示希望到达的目标网段,以CIDR形式标记。gw对应网关的IP地址,也就是要到达目标网段所要经过的“next hop(下一跳)”。如果省略gw的话,那么插件会自动帮我们选择默认网关。在我们的例子里,gw选择的是默认网关,而dst为
0.0.0.0/0则代表“任何网络”,表示数据包将通过默认网关发往任何网络。实际上,这对应的是一条默认路由规则,即:当所有其他路由规则都不匹配时,将选择该路由。 - rangeStart:允许分配的IP地址范围的起始值
- rangeEnd:允许分配的IP地址范围的结束值
- gateway:为网关(也就是我们将要在宿主机上创建的bridge)指定的IP地址。如果省略的话,那么插件会自动从允许分配的IP地址范围内选择起始值作为网关的IP地址。
-
dns(dictionary, optional):dns配置-
nameservers(list of strings, optional) -
domain(string, optional) -
search(list of strings, optional) -
options(list of strings, optional)
-
2.3. 配置文件示例
$ mkdir -p /etc/cni/net.d
$ cat >/etc/cni/net.d/10-mynet.conf <<EOF
{
"cniVersion": "1.0.0",
"name": "dbnet",
"plugins": [
{
"type": "bridge",
// plugin specific parameters
"bridge": "cni0",
"keyA": ["some more", "plugin specific", "configuration"],
"ipam": {
"type": "host-local",
// ipam specific
"subnet": "10.1.0.0/16",
"gateway": "10.1.0.1",
"routes": [
{"dst": "0.0.0.0/0"}
]
},
"dns": {
"nameservers": [ "10.1.0.1" ]
}
},
{
"type": "tuning",
"capabilities": {
"mac": true
},
"sysctl": {
"net.core.somaxconn": "500"
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true}
}
]
}
3. CNI插件
3.1. 安装插件
安装CNI二进制插件,插件下载地:https://github.com/containernetworking/plugins/releases
# 下载二进制
wget https://github.com/containernetworking/plugins/releases/download/v1.1.0/cni-plugins-linux-amd64-v1.1.0.tgz
# 解压文件
tar -zvxf cni-plugins-linux-amd64-v1.1.0.tgz -C /opt/cni/bin/
# 查看解压文件
# ll -h
总用量 63M
-rwxr-xr-x 1 root root 3.7M 2月 24 01:01 bandwidth
-rwxr-xr-x 1 root root 4.1M 2月 24 01:01 bridge
-rwxr-xr-x 1 root root 9.3M 2月 24 01:01 dhcp
-rwxr-xr-x 1 root root 4.2M 2月 24 01:01 firewall
-rwxr-xr-x 1 root root 3.7M 2月 24 01:01 host-device
-rwxr-xr-x 1 root root 3.1M 2月 24 01:01 host-local
-rwxr-xr-x 1 root root 3.8M 2月 24 01:01 ipvlan
-rwxr-xr-x 1 root root 3.2M 2月 24 01:01 loopback
-rwxr-xr-x 1 root root 3.8M 2月 24 01:01 macvlan
-rwxr-xr-x 1 root root 3.6M 2月 24 01:01 portmap
-rwxr-xr-x 1 root root 4.0M 2月 24 01:01 ptp
-rwxr-xr-x 1 root root 3.4M 2月 24 01:01 sbr
-rwxr-xr-x 1 root root 2.7M 2月 24 01:01 static
-rwxr-xr-x 1 root root 3.3M 2月 24 01:01 tuning
-rwxr-xr-x 1 root root 3.8M 2月 24 01:01 vlan
-rwxr-xr-x 1 root root 3.4M 2月 24 01:01 vrf
3.2. 插件分类
参考:https://www.cni.dev/plugins/current/
| 分类 | 插件 | 说明 |
|---|---|---|
| main | bridge | Creates a bridge, adds the host and the container to it |
| ipvlan | Adds an ipvlan interface in the container | |
| macvlan | Creates a new MAC address, forwards all traffic to that to the container | |
| ptp | Creates a veth pair | |
| host-device | Moves an already-existing device into a container | |
| vlan | Creates a vlan interface off a master | |
| IPAM | dhcp | Runs a daemon on the host to make DHCP requests on behalf of a container |
| host-local | Maintains a local database of allocated IPs | |
| static | Allocates static IPv4/IPv6 addresses to containers | |
| meta | tuning | Changes sysctl parameters of an existing interface |
| portmap | An iptables-based portmapping plugin. Maps ports from the host’s address space to the container | |
| bandwidth | Allows bandwidth-limiting through use of traffic control tbf (ingress/egress) | |
| sbr | A plugin that configures source based routing for an interface (from which it is chained) | |
| firewall | A firewall plugin which uses iptables or firewalld to add rules to allow traffic to/from the container |
4. CNI插件接口
具体可参考:https://github.com/containernetworking/cni/blob/master/SPEC.md#cni-operations
CNI定义的接口操作有:
ADD:添加容器网络,在容器启动时调用。DEL:删除容器网络,在容器删除时调用。CHECK:检查容器网络是否正常。VERSION:显示插件版本。
这些操作通过CNI_COMMAND环境变量来传递给CNI插件二进制。
其中环境变量包括:
-
CNI_COMMAND:命令操作,包括ADD,DEL,CHECK, orVERSION。 -
CNI_CONTAINERID:容器的ID,有runtime分配,不为空。 -
CNI_NETNS:容器的网络命名空间,命名空间路径,例如:/run/netns/[nsname] -
CNI_IFNAME:容器内的网卡名称。 -
CNI_ARGS:其他参数。 -
CNI_PATH:CNI插件二进制的路径。
4.1. ADD接口:添加容器网络
在容器的网络命名空间CNI_NETNS中创建CNI_IFNAME网卡设备,或者调整网卡配置。
必选参数:
CNI_COMMANDCNI_CONTAINERIDCNI_NETNSCNI_IFNAME
可选参数:
CNI_ARGSCNI_PATH
4.2. DEL接口:删除容器网络
删除容器网络命名空间CNI_NETNS中的容器网卡CNI_IFNAME,或者撤销ADD修改操作。
必选参数:
CNI_COMMANDCNI_CONTAINERIDCNI_IFNAME
可选参数:
CNI_NETNSCNI_ARGSCNI_PATH
4.3. CHECK接口:检查容器网络
4.4. VERSION接口:输出CNI的版本
参考:
- https://www.cni.dev/docs/spec/
- https://github.com/containernetworking/cni
- https://github.com/containernetworking/cni/blob/spec-v0.4.0/SPEC.md
- https://kubernetes.io/zh/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/
- https://github.com/containernetworking/plugins/tree/master/plugins
- https://www.cni.dev/plugins/current/
- https://cloud.tencent.com/developer/news/600713
- 配置CNI插件
5.3.2 - Macvlan介绍
1. 简介
macvlan可以看做是物理接口eth(父接口)的子接口,每个macvlan都拥有独立的mac地址,可以被绑定IP作为正常的网卡接口使用。通过这个特性,可以实现在一个物理网络设备绑定多个IP,每个IP拥有独立的mac地址。该特性经常被应用在容器虚拟化中(容器可以配置macvlan的网络,将macvlan interface移动到容器的namespace中)。
示意图:
2. 四种工作模式
2.1. VEPA (Virtual Ethernet Port Aggregator)
VEPA为默认的工作模式,该模式下,所有macvlan发出的流量都会经过父接口,不管目的地是否与该macvlan共用一个父接口。
2.2. Bridge mode
该bridge模式类似于传统的网桥模式,拥有相同父接口的macvlan可以直接进行通信,不需要将数据发给父接口转发。该模式下不需要交换机支持hairpin模式,性能比VEPA模式好。另外相对于传统的网桥模式,该模式不需要学习mac地址,不需要STP,使得其性能比传统的网桥性能好得多。但是如果父接口down掉,则所有子接口也会down,同时无法通信。
2.3. Private mode
该模式是VEPA模式的增强版,
2.4. Passthru mode
.
待完善
参考:
- https://backreference.org/2014/03/20/some-notes-on-macvlanmacvtap/
- https://github.com/containernetworking/plugins/tree/master/plugins/main/macvlan
- https://github.com/containernetworking/plugins/blob/master/plugins/main/macvlan/macvlan.go
- http://wikibon.org/wiki/v/Edge_Virtual_Bridging
- Linux 虚拟网卡技术:Macvlan
- http://hicu.be/bridge-vs-macvlan
- http://hicu.be/docker-networking-macvlan-bridge-mode-configuration
5.4 - 网络插件
5.4.1 - Flannel介绍
1. flannel是什么(what)
1.1. 概述
Flannel是CoreOS团队针对Kubernetes设计的一个网络规划服务,简单来说,它的功能是让集群中的不同节点主机创建的Docker容器都具有全集群唯一的虚拟IP地址。 Flannel官网:https://github.com/coreos/flannel
1.2. 补充知识点
1、覆盖网络[overlay network]
运行在一个网上的网(应用层网络),并不依靠ip地址来传递消息,而是采用一种映射机制,把ip地址和identifiers做映射来资源定位。
2、路由
互联网是由路由器连接的网络组合而成,路由器按照路由表、路由协议等机制实现对数据包正确地转发,从而到达目标主机。路由器根据数据包中目标主机的IP地址和路由控制表比较得出下一个接收数据的路由器。
1)静态路由:事先设置好路由器和主机中的路由表信息。

2)动态路由:让路由协议在运行中自动修改并设置路由表信息。

2. 为什么使用flannel(why)
在默认的Docker配置中,每个节点上的Docker服务会分别负责所在节点容器的IP分配。这样导致的一个问题是,不同节点上容器可能获得相同的内外IP地址。
Flannel的设计目的就是为集群中的所有节点重新规划IP地址的使用规则,从而使得不同节点上的容器能够获得“同属一个内网”且”不重复的”IP地址,并让属于不同节点上的容器能够直接通过内网IP通信。
3. 如何实现flannel(how)
Flannel实质上是一种“覆盖网络(overlay network)”,也就是将TCP数据包装在另一种网络包里面进行路由转发和通信,目前已经支持UDP、VxLAN、AWS VPC和GCE路由等数据转发方式,默认的节点间数据通信方式是UDP转发。
3.1. flannel原理图

- 数据从源容器中发出后,经由所在主机的docker0虚拟网卡转发到flannel0虚拟网卡,这是个P2P的虚拟网卡,flanneld服务监听在网卡的另外一端。
- Flannel通过Etcd服务维护了一张节点间的路由表。
- 源主机的flanneld服务将原本的数据内容UDP封装后根据自己的路由表投递给目的节点的flanneld服务,数据到达以后被解包,然后直 接进入目的节点的flannel0虚拟网卡,然后被转发到目的主机的docker0虚拟网卡,最后就像本机容器通信一下的有docker0路由到达目标容 器。
3.2. 实现说明
1、UDP封装
原始数据是在起始节点的Flannel服务上进行UDP封装的,投递到目的节点后就被另一端的Flannel服务还原成了原始的数据包,两边的Docker服务都感觉不到这个过程的存在。 UDP的数据内容部分其实是另一个ICMP(也就是ping命令)的数据包。

2、为docker分配不同的IP段
Flannel通过Etcd分配了每个节点可用的IP地址段后,偷偷的修改了Docker的启动参数。

注意其中的“--bip=172.17.18.1/24”这个参数,它限制了所在节点容器获得的IP范围。
这个IP范围是由Flannel自动分配的,由Flannel通过保存在Etcd服务中的记录确保它们不会重复。
3、路由规则
1)数据发送节点的路由表

2)数据接收节点的路由表

例如现在有一个数据包要从IP为172.17.18.2的容器发到IP为172.17.46.2的容器。根据数据发送节点的路由表,它只与 172.17.0.0/16匹配这条记录匹配,因此数据从docker0出来以后就被投递到了flannel0。同理在目标节点,由于投递的地址是一个容 器,因此目的地址一定会落在docker0对于的172.17.46.0/24这个记录上,自然的被投递到了docker0网卡。
3.3. flannel的安装与配置
1、安装
wget http://<官网>/flannel/flannel-0.2.0-10.el7.x86_64.rpm
yum localinstall -y flannel-0.2.0-10.el7.x86_64.rpm
2、配置
vi /etc/sysconfig/flanneld
# Flanneld configuration options
# etcd url location. Point this to the server where etcd runs
FLANNEL_ETCD="http://127.0.0.1:4001"
# etcd config key. This is the configuration key that flannel queries
# For address range assignment
FLANNEL_ETCD_KEY="/xxx/flannel/product/network"
# Any additional options that you want to pass
FLANNEL_OPTIONS=" -iface=eth0"
3、初始化flannel的etcd配置
etcdctl set /xxx/flannel/network/config '{
"Network": "10.0.0.0/16",
"Backend": {
"Type": "vxlan"
}
}'
6 - 容器存储
6.1 - 存储卷概念
6.1.1 -
Volume
1. volume概述
- 容器上的文件生命周期同容器的生命周期一致,即容器挂掉之后,容器将会以最初镜像中的文件系统内容启动,之前容器运行时产生的文件将会丢失。
- Pod的volume的生命周期同Pod的生命周期一致,当Pod被删除的时候,对应的volume才会被删除。即Pod中的容器重启时,之前的文件仍可以保存。
容器中的进程看到的是由其 Docker 镜像和卷组成的文件系统视图。
Pod volume的使用方式
Pod 中的每个容器都必须独立指定每个卷的挂载位置,需要给Pod配置volume相关参数。
Pod的volume关键字段如下:
- spec.volumes:提供怎样的数据卷
- spec.containers.volumeMounts:挂载到容器的什么路径
2. volume类型
2.1. emptyDir
1、特点
- 会创建
emptyDir对应的目录,默认为空(如果该目录原来有文件也会被重置为空) - Pod中的不同容器可以在目录中读写相同文件(即Pod中的不同容器可以通过该方式来共享文件)
- 当Pod被删除,
emptyDir中的数据将被永久删除,如果只是Pod挂掉该数据还会保留
2、使用场景
- 不同容器之间共享文件(例如日志采集等)
- 暂存空间,例如用于基于磁盘的合并排序
- 用作长时间计算崩溃恢复时的检查点
3、示例
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /cache
name: cache-volume
volumes:
- name: cache-volume
emptyDir: {}
2.2. hostPath
1、特点
- 会将宿主机的目录或文件挂载到Pod中
2、使用场景
-
运行需要访问 Docker 内部的容器;使用
/var/lib/docker的hostPath -
在容器中运行 cAdvisor;使用
/dev/cgroups的hostPath -
其他使用到宿主机文件的场景
hostPath的type字段
| 值 | 行为 |
|---|---|
| 空字符串(默认)用于向后兼容,这意味着在挂载 hostPath 卷之前不会执行任何检查。 | |
DirectoryOrCreate |
如果在给定的路径上没有任何东西存在,那么将根据需要在那里创建一个空目录,权限设置为 0755,与 Kubelet 具有相同的组和所有权。 |
Directory |
给定的路径下必须存在目录 |
FileOrCreate |
如果在给定的路径上没有任何东西存在,那么会根据需要创建一个空文件,权限设置为 0644,与 Kubelet 具有相同的组和所有权。 |
File |
给定的路径下必须存在文件 |
Socket |
给定的路径下必须存在 UNIX 套接字 |
CharDevice |
给定的路径下必须存在字符设备 |
BlockDevice |
给定的路径下必须存在块设备 |
注意事项
- 由于每个节点上的文件都不同,具有相同配置的 pod 在不同节点上的行为可能会有所不同
- 当 Kubernetes 按照计划添加资源感知调度时,将无法考虑
hostPath使用的资源 - 在底层主机上创建的文件或目录只能由 root 写入。您需要在特权容器中以 root 身份运行进程,或修改主机上的文件权限以便写入
hostPath卷
3、示例
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /data
# this field is optional
type: Directory
2.3. configMap
configMap提供了一种给Pod注入配置文件的方式,配置文件内容存储在configMap对象中,如果Pod使用configMap作为volume的类型,需要先创建configMap的对象。
示例
apiVersion: v1
kind: Pod
metadata:
name: configmap-pod
spec:
containers:
- name: test
image: busybox
volumeMounts:
- name: config-vol
mountPath: /etc/config
volumes:
- name: config-vol
configMap:
name: log-config
items:
- key: log_level
path: log_level
2.4. cephfs
cephfs的方式将Pod的存储挂载到ceph集群中,通过外部存储的方式持久化Pod的数据(即当Pod被删除数据仍可以存储在ceph集群中),前提是先部署和维护好一个ceph集群。
示例
apiVersion: v1
kind: Pod
metadata:
name: cephfs
spec:
containers:
- name: cephfs-rw
image: kubernetes/pause
volumeMounts:
- mountPath: "/mnt/cephfs"
name: cephfs
volumes:
- name: cephfs
cephfs:
monitors:
- 10.16.154.78:6789
- 10.16.154.82:6789
- 10.16.154.83:6789
# by default the path is /, but you can override and mount a specific path of the filesystem by using the path attribute
# path: /some/path/in/side/cephfs
user: admin
secretFile: "/etc/ceph/admin.secret"
readOnly: true
更多可参考 CephFS 示例。
2.5. nfs
nfs的方式类似cephfs,即将Pod数据存储到NFS集群中,具体可参考NFS示例。
2.6. persistentVolumeClaim
persistentVolumeClaim 卷用于将PersistentVolume挂载到容器中。PersistentVolumes 是在用户不知道特定云环境的细节的情况下“声明”持久化存储(例如 GCE PersistentDisk 或 iSCSI 卷)的一种方式。
参考文章:
6.1.2 -
PersistentVolume
1. PV概述
PersistentVolume(简称PV) 是 Volume 之类的卷插件,也是集群中的资源,但独立于Pod的生命周期(即不会因Pod删除而被删除),不归属于某个Namespace。
2. PV和PVC的生命周期
2.1. 配置(Provision)
有两种方式来配置 PV:静态或动态。
1、静态
手动创建PV,可供k8s集群中的对象消费。
2、动态
可以通过StorageClass和具体的Provisioner(例如nfs-client-provisioner)来动态地创建和删除PV。
2.2. 绑定
在动态配置的情况下,用户创建了特定的PVC,k8s会监听新的PVC,并寻找匹配的PV绑定。一旦绑定后,这种绑定是排他性的,PVC和PV的绑定是一对一的映射。
2.3. 使用
Pod 使用PVC作为卷。集群检查PVC以查找绑定的卷并为集群挂载该卷。用户通过在 Pod 的 volume 配置中包含 persistentVolumeClaim 来调度 Pod 并访问用户声明的 PV。
2.4. 回收
PV的回收策略可以设定PVC在释放后如何处理对应的Volume,目前有 Retained, Recycled 和 Deleted三种策略。
1、保留(Retain)
保留策略允许手动回收资源,当删除PVC的时候,PV仍然存在,可以通过以下步骤回收卷:
- 删除PV
- 手动清理外部存储的数据资源
- 手动删除或重新使用关联的存储资产
2、回收(Resycle)
该策略已废弃,推荐使用dynamic provisioning
回收策略会在 volume上执行基本擦除(rm -rf / thevolume / *),可被再次声明使用。
3、删除(Delete)
删除策略,当发生删除操作的时候,会从k8s集群中删除PV对象,并执行外部存储资源的删除操作(根据不同的provisioner定义的删除逻辑不同,有的是重命名)。
动态配置的卷继承其StorageClass的回收策略,默认为Delete,即当用户删除PVC的时候,会自动执行PV的删除策略。
如果要修改PV的回收策略,可执行以下命令:
# Get pv
kubectl get pv
# Change policy to Retaion
kubectl patch pv <pv_name> -p ‘{“spec”:{“persistentVolumeReclaimPolicy”:“Retain”}}’
3. PV的类型
PersistentVolume 类型以插件形式实现。以下仅列部分常用类型:
- GCEPersistentDisk
- AWSElasticBlockStore
- NFS
- RBD (Ceph Block Device)
- CephFS
- Glusterfs
4. PV的属性
每个 PV 配置中都包含一个 sepc 规格字段和一个 status 卷状态字段。
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: fuseim.pri/ifs
creationTimestamp: 2018-07-12T06:46:48Z
name: default-test-web-0-pvc-58cf5ec1-859f-11e8-bb61-005056b83985
resourceVersion: "100163256"
selfLink: /api/v1/persistentvolumes/default-test-web-0-pvc-58cf5ec1-859f-11e8-bb61-005056b83985
uid: 59796ba3-859f-11e8-9c50-c81f66bcff65
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 2Gi
volumeMode: Filesystem
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: test-web-0
namespace: default
resourceVersion: "100163248"
uid: 58cf5ec1-859f-11e8-bb61-005056b83985
nfs:
path: /data/nfs-storage/default-test-web-0-pvc-58cf5ec1-859f-11e8-bb61-005056b83985
server: 172.16.201.54
persistentVolumeReclaimPolicy: Delete
storageClassName: managed-nfs-storage
mountOptions:
- hard
- nfsvers=4.1
status:
phase: Bound
4.1. Capacity
给PV设置特定的存储容量,更多 capacity 可参考Kubernetes 资源模型 。
4.2. Volume Mode
volumeMode 的有效值可以是Filesystem或Block。如果未指定,volumeMode 将默认为Filesystem。
4.3. Access Modes
访问模式包括:
ReadWriteOnce——该卷可以被单个节点以读/写模式挂载ReadOnlyMany——该卷可以被多个节点以只读模式挂载ReadWriteMany——该卷可以被多个节点以读/写模式挂载
在命令行中,访问模式缩写为:
- RWO - ReadWriteOnce
- ROX - ReadOnlyMany
- RWX - ReadWriteMany
一个卷一次只能使用一种访问模式挂载,即使它支持很多访问模式。
以下只列举部分常用插件:
| Volume 插件 | ReadWriteOnce | ReadOnlyMany | ReadWriteMany |
|---|---|---|---|
| AWSElasticBlockStore | ✓ | - | - |
| CephFS | ✓ | ✓ | ✓ |
| GCEPersistentDisk | ✓ | ✓ | - |
| Glusterfs | ✓ | ✓ | ✓ |
| HostPath | ✓ | - | - |
| NFS | ✓ | ✓ | ✓ |
| RBD | ✓ | ✓ | - |
| ... | - |
4.4. Class
PV可以指定一个StorageClass来动态绑定PV和PVC,其中通过 storageClassName 属性来指定具体的StorageClass,如果没有指定该属性的PV,它只能绑定到不需要特定类的 PVC。
4.5. Reclaim Policy
回收策略包括:
Retain(保留)——手动回收Recycle(回收)——基本擦除(rm -rf /thevolume/*)Delete(删除)——关联的存储资产(例如 AWS EBS、GCE PD、Azure Disk 和 OpenStack Cinder 卷)将被删除
当前,只有 NFS 和 HostPath 支持回收策略。AWS EBS、GCE PD、Azure Disk 和 Cinder 卷支持删除策略。
4.6. Mount Options
Kubernetes 管理员可以指定在节点上为挂载持久卷指定挂载选项。
注意:不是所有的持久化卷类型都支持挂载选项。
支持挂载选项常用的类型有:
- GCEPersistentDisk
- AWSElasticBlockStore
- AzureFile
- AzureDisk
- NFS
- RBD (Ceph Block Device)
- CephFS
- Cinder (OpenStack 卷存储)
- Glusterfs
4.7. Phase
PV可以处于以下的某种状态:
Available(可用)——一块空闲资源还没有被任何声明绑定Bound(已绑定)——卷已经被声明绑定Released(已释放)——声明被删除,但是资源还未被集群重新声明Failed(失败)——该卷的自动回收失败
命令行会显示绑定到 PV 的 PVC 的名称。
参考文章:
6.1.3 -
PersistentVolumeClaim
1. PVC概述
PersistentVolumeClaim(简称PVC)是用户存储的请求,PVC消耗PV的资源,可以请求特定的大小和访问模式,需要指定归属于某个Namespace,在同一个Namespace的Pod才可以指定对应的PVC。
当需要不同性质的PV来满足存储需求时,可以使用StorageClass来实现。
每个 PVC 中都包含一个 spec 规格字段和一个 status 声明状态字段。
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 8Gi
storageClassName: slow
selector:
matchLabels:
release: "stable"
matchExpressions:
- {key: environment, operator: In, values: [dev]}
2. PVC的属性
2.1. accessModes
对应存储的访问模式,例如:ReadWriteOnce。
2.2. volumeMode
对应存储的数据卷模式,例如:Filesystem。
2.3. resources
声明可以请求特定数量的资源。相同的资源模型适用于Volume和PVC。
2.4. selector
声明label selector,只有标签与选择器匹配的卷可以绑定到声明。
- matchLabels:volume 必须有具有该值的标签
- matchExpressions:条件列表,通过条件表达式筛选匹配的卷。有效的运算符包括 In、NotIn、Exists 和 DoesNotExist。
2.5. storageClassName
通过storageClassName参数来指定使用对应名字的StorageClass,只有所请求的类与 PVC 具有相同 storageClassName 的 PV 才能绑定到 PVC。
PVC可以不指定storageClassName,或者将该值设置为空,如果打开了准入控制插件,并且指定一个默认的 StorageClass,则PVC会使用默认的StorageClass,否则就绑定到没有StorageClass的 PV上。
之前使用注解
volume.beta.kubernetes.io/storage-class而不是storageClassName属性。这个注解仍然有效,但是在未来的 Kubernetes 版本中不会支持。
3. 将PVC作为Volume
将PVC作为Pod的Volume,PVC与Pod需要在同一个命名空间下,其实Pod的声明如下:
kind: Pod
apiVersion: v1
metadata:
name: mypod
spec:
containers:
- name: myfrontend
image: dockerfile/nginx
volumeMounts:
- mountPath: "/var/www/html"
name: mypd
volumes:
- name: mypd
persistentVolumeClaim: # 使用PVC
claimName: myclaim
PersistentVolumes 绑定是唯一的,并且由于 PersistentVolumeClaims 是命名空间对象,因此只能在一个命名空间内挂载具有“多个”模式(ROX、RWX)的PVC。
参考文章:
6.1.4 -
StorageClass
1. StorageClass概述
StorageClass提供了一种描述存储类(class)的方法,不同的class可能会映射到不同的服务质量等级和备份策略或其他策略等。
StorageClass 对象中包含 provisioner、parameters 和 reclaimPolicy 字段,当需要动态分配 PersistentVolume 时会使用到。当创建 StorageClass 对象时,设置名称和其他参数,一旦创建了对象就不能再对其更新。也可以为没有申请绑定到特定 class 的 PVC 指定一个默认的 StorageClass 。
StorageClass对象文件
kind: StorageClass
apiVersion: storage.k8s.io/v3
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
reclaimPolicy: Retain
mountOptions:
- debug
2. StorageClass的属性
2.1. Provisioner(存储分配器)
Storage class 有一个分配器(provisioner),用来决定使用哪个卷插件分配 PV,该字段必须指定。可以指定内部分配器,也可以指定外部分配器。外部分配器的代码地址为: kubernetes-incubator/external-storage,其中包括NFS和Ceph等。
2.2. Reclaim Policy(回收策略)
可以通过reclaimPolicy字段指定创建的Persistent Volume的回收策略,回收策略包括:Delete 或者 Retain,没有指定默认为Delete。
2.3. Mount Options(挂载选项)
由 storage class 动态创建的 Persistent Volume 将使用 class 中 mountOptions 字段指定的挂载选项。
2.4. 参数
Storage class 具有描述属于 storage class 卷的参数。取决于分配器,可以接受不同的参数。 当参数被省略时,会使用默认值。
例如以下使用Ceph RBD
kind: StorageClass
apiVersion: storage.k8s.io/v3
metadata:
name: fast
provisioner: kubernetes.io/rbd
parameters:
monitors: 30.36.353.305:6789
adminId: kube
adminSecretName: ceph-secret
adminSecretNamespace: kube-system
pool: kube
userId: kube
userSecretName: ceph-secret-user
fsType: ext4
imageFormat: "2"
imageFeatures: "layering"
对应的参数说明
-
monitors:Ceph monitor,逗号分隔。该参数是必需的。 -
adminId:Ceph 客户端 ID,用于在池(ceph pool)中创建映像。 默认是 “admin”。 -
adminSecretNamespace:adminSecret 的 namespace。默认是 “default”。 -
adminSecret:adminId 的 Secret 名称。该参数是必需的。 提供的 secret 必须有值为 “kubernetes.io/rbd” 的 type 参数。 -
pool: Ceph RBD 池. 默认是 “rbd”。 -
userId:Ceph 客户端 ID,用于映射 RBD 镜像(RBD image)。默认与 adminId 相同。 -
userSecretName:用于映射 RBD 镜像的 userId 的 Ceph Secret 的名字。 它必须与 PVC 存在于相同的 namespace 中。该参数是必需的。 提供的 secret 必须具有值为 “kubernetes.io/rbd” 的 type 参数,例如以这样的方式创建:kubectl create secret generic ceph-secret --type="kubernetes.io/rbd" \ --from-literal=key='QVFEQ1pMdFhPUnQrSmhBQUFYaERWNHJsZ3BsMmNjcDR6RFZST0E9PQ==' \ --namespace=kube-system -
fsType:Kubernetes 支持的 fsType。默认:"ext4"。 -
imageFormat:Ceph RBD 镜像格式,”1” 或者 “2”。默认值是 “1”。 -
imageFeatures:这个参数是可选的,只能在你将 imageFormat 设置为 “2” 才使用。 目前支持的功能只是layering。 默认是 ““,没有功能打开。
参考文章:
6.1.5 -
Dynamic Volume Provisioning
Dynamic volume provisioning允许用户按需自动创建存储卷,这种方式可以让用户不需要关心存储的复杂性和差别,又可以选择不同的存储类型。
1. 开启Dynamic Provisioning
需要先提前创建StorageClass对象,StorageClass中定义了使用哪个provisioner,并且在provisioner被调用时传入哪些参数,具体可参考StorageClass介绍。
例如:
- 磁盘类存储
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
- SSD类存储
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
2. 使用Dynamic Provisioning
创建一个PVC对象,并且在其中storageClassName字段指明需要用到的StorageClass的名称,例如:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: claim1
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast
resources:
requests:
storage: 30Gi
当使用到PVC的时候会自动创建对应的外部存储,当PVC被删除的时候,会自动销毁(或备份)外部存储。
3. 默认的StorageClass
当没有对应的StorageClass配置时,可以设定默认的StorageClass,需要执行以下操作:
- 在API Server开启
DefaultStorageClassadmission controller 。 - 设置默认的
StorageClass对象。
可以通过添加storageclass.kubernetes.io/is-default-class注解的方式设置某个StorageClass为默认的StorageClass。当用户创建了一个PersistentVolumeClaim,但没有指定storageClassName的时候,会自动将该PVC的storageClassName指向默认的StorageClass。
参考文章:
6.2 - CSI
6.2.1 - csi-cephfs-plugin
1. 编译CSI CephFS plugin
CSI CephFS plugin用来提供CephFS存储卷和挂载存储卷,源码参考:https://github.com/ceph/ceph-csi 。
1.1. 编译二进制
$ make cephfsplugin
1.2. 编译Docker镜像
$ make image-cephfsplugin
2. 配置项
2.1. 命令行参数
| Option | Default value | Description |
|---|---|---|
--endpoint |
unix://tmp/csi.sock |
CSI endpoint, must be a UNIX socket |
--drivername |
csi-cephfsplugin |
name of the driver (Kubernetes: provisioner field in StorageClass must correspond to this value) |
--nodeid |
empty | This node’s ID |
--volumemounter |
empty | default volume mounter. Available options are kernel and fuse. This is the mount method used if volume parameters don’t specify otherwise. If left unspecified, the driver will first probe for ceph-fuse in system’s path and will choose Ceph kernel client if probing failed. |
2.2. volume参数
| Parameter | Required | Description |
|---|---|---|
monitors |
yes | Comma separated list of Ceph monitors (e.g. 192.168.100.1:6789,192.168.100.2:6789,192.168.100.3:6789) |
mounter |
no | Mount method to be used for this volume. Available options are kernel for Ceph kernel client and fuse for Ceph FUSE driver. Defaults to “default mounter”, see command line arguments. |
provisionVolume |
yes | Mode of operation. BOOL value. If true, a new CephFS volume will be provisioned. If false, an existing CephFS will be used. |
pool |
for provisionVolume=true |
Ceph pool into which the volume shall be created |
rootPath |
for provisionVolume=false |
Root path of an existing CephFS volume |
csiProvisionerSecretName, csiNodeStageSecretName |
for Kubernetes | name of the Kubernetes Secret object containing Ceph client credentials. Both parameters should have the same value |
csiProvisionerSecretNamespace, csiNodeStageSecretNamespace |
for Kubernetes | namespaces of the above Secret objects |
2.3. provisionVolume
2.3.1. 管理员密钥认证
当provisionVolume=true时,必要的管理员认证参数如下:
adminID: ID of an admin clientadminKey: key of the admin client
2.3.2. 普通用户密钥认证
当provisionVolume=false时,必要的用户认证参数如下:
userID: ID of a user clientuserKey: key of a user client
参考文章:
6.2.2 - 部署csi-cephfs
0. 说明
要求Kubernetes的版本在1.11及以上,k8s集群必须允许特权Pod(privileged pods),即apiserver和kubelet需要设置--allow-privileged为true。节点的Docker daemon需要允许挂载共享卷。
涉及镜像
- quay.io/k8scsi/csi-provisioner:v0.3.0
- quay.io/k8scsi/csi-attacher:v0.3.0
- quay.io/k8scsi/driver-registrar:v0.3.0
- quay.io/cephcsi/cephfsplugin:v0.3.0
1. 部署RBAC
部署service accounts, cluster roles 和 cluster role bindings,这些可供RBD和CephFS CSI plugins共同使用,他们拥有相同的权限。
$ kubectl create -f csi-attacher-rbac.yaml
$ kubectl create -f csi-provisioner-rbac.yaml
$ kubectl create -f csi-nodeplugin-rbac.yaml
1.1. csi-attacher-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: csi-attacher
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: external-attacher-runner
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments"]
verbs: ["get", "list", "watch", "update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: csi-attacher-role
subjects:
- kind: ServiceAccount
name: csi-attacher
namespace: default
roleRef:
kind: ClusterRole
name: external-attacher-runner
apiGroup: rbac.authorization.k8s.io
1.2. csi-provisioner-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: csi-provisioner
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: external-provisioner-runner
rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["list", "watch", "create", "update", "patch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: csi-provisioner-role
subjects:
- kind: ServiceAccount
name: csi-provisioner
namespace: default
roleRef:
kind: ClusterRole
name: external-provisioner-runner
apiGroup: rbac.authorization.k8s.io
1.3. csi-nodeplugin-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: csi-nodeplugin
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: csi-nodeplugin
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "update"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments"]
verbs: ["get", "list", "watch", "update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: csi-nodeplugin
subjects:
- kind: ServiceAccount
name: csi-nodeplugin
namespace: default
roleRef:
kind: ClusterRole
name: csi-nodeplugin
apiGroup: rbac.authorization.k8s.io
2. 部署CSI sidecar containers
通过StatefulSet的方式部署external-attacher和external-provisioner供CSI CephFS使用。
$ kubectl create -f csi-cephfsplugin-attacher.yaml
$ kubectl create -f csi-cephfsplugin-provisioner.yaml
2.1. csi-cephfsplugin-provisioner.yaml
kind: Service
apiVersion: v1
metadata:
name: csi-cephfsplugin-provisioner
labels:
app: csi-cephfsplugin-provisioner
spec:
selector:
app: csi-cephfsplugin-provisioner
ports:
- name: dummy
port: 12345
---
kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
name: csi-cephfsplugin-provisioner
spec:
serviceName: "csi-cephfsplugin-provisioner"
replicas: 1
template:
metadata:
labels:
app: csi-cephfsplugin-provisioner
spec:
serviceAccount: csi-provisioner
containers:
- name: csi-provisioner
image: quay.io/k8scsi/csi-provisioner:v0.3.0
args:
- "--provisioner=csi-cephfsplugin"
- "--csi-address=$(ADDRESS)"
- "--v=5"
env:
- name: ADDRESS
value: /var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /var/lib/kubelet/plugins/csi-cephfsplugin
volumes:
- name: socket-dir
hostPath:
path: /var/lib/kubelet/plugins/csi-cephfsplugin
type: DirectoryOrCreate
2.2. csi-cephfsplugin-attacher.yaml
kind: Service
apiVersion: v1
metadata:
name: csi-cephfsplugin-attacher
labels:
app: csi-cephfsplugin-attacher
spec:
selector:
app: csi-cephfsplugin-attacher
ports:
- name: dummy
port: 12345
---
kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
name: csi-cephfsplugin-attacher
spec:
serviceName: "csi-cephfsplugin-attacher"
replicas: 1
template:
metadata:
labels:
app: csi-cephfsplugin-attacher
spec:
serviceAccount: csi-attacher
containers:
- name: csi-cephfsplugin-attacher
image: quay.io/k8scsi/csi-attacher:v0.3.0
args:
- "--v=5"
- "--csi-address=$(ADDRESS)"
env:
- name: ADDRESS
value: /var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: socket-dir
mountPath: /var/lib/kubelet/plugins/csi-cephfsplugin
volumes:
- name: socket-dir
hostPath:
path: /var/lib/kubelet/plugins/csi-cephfsplugin
type: DirectoryOrCreate
3. 部署CSI-CephFS-driver(plugin)
csi-cephfs-plugin 的作用类似nfs-client,部署在所有node节点上,执行ceph的挂载等相关任务。
通过DaemonSet的方式部署,其中包括两个容器:CSI driver-registrar 和 CSI CephFS driver。
$ kubectl create -f csi-cephfsplugin.yaml
3.1. csi-cephfsplugin.yaml
kind: DaemonSet
apiVersion: apps/v1beta2
metadata:
name: csi-cephfsplugin
spec:
selector:
matchLabels:
app: csi-cephfsplugin
template:
metadata:
labels:
app: csi-cephfsplugin
spec:
serviceAccount: csi-nodeplugin
hostNetwork: true
# to use e.g. Rook orchestrated cluster, and mons' FQDN is
# resolved through k8s service, set dns policy to cluster first
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: driver-registrar
image: quay.io/k8scsi/driver-registrar:v0.3.0
args:
- "--v=5"
- "--csi-address=$(ADDRESS)"
- "--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)"
env:
- name: ADDRESS
value: /var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
- name: DRIVER_REG_SOCK_PATH
value: /var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: socket-dir
mountPath: /var/lib/kubelet/plugins/csi-cephfsplugin
- name: registration-dir
mountPath: /registration
- name: csi-cephfsplugin
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN"]
allowPrivilegeEscalation: true
image: quay.io/cephcsi/cephfsplugin:v0.3.0
args :
- "--nodeid=$(NODE_ID)"
- "--endpoint=$(CSI_ENDPOINT)"
- "--v=5"
- "--drivername=csi-cephfsplugin"
env:
- name: NODE_ID
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix://var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
imagePullPolicy: "IfNotPresent"
volumeMounts:
- name: plugin-dir
mountPath: /var/lib/kubelet/plugins/csi-cephfsplugin
- name: pods-mount-dir
mountPath: /var/lib/kubelet/pods
mountPropagation: "Bidirectional"
- mountPath: /sys
name: host-sys
- name: lib-modules
mountPath: /lib/modules
readOnly: true
- name: host-dev
mountPath: /dev
volumes:
- name: plugin-dir
hostPath:
path: /var/lib/kubelet/plugins/csi-cephfsplugin
type: DirectoryOrCreate
- name: registration-dir
hostPath:
path: /var/lib/kubelet/plugins/
type: Directory
- name: pods-mount-dir
hostPath:
path: /var/lib/kubelet/pods
type: Directory
- name: socket-dir
hostPath:
path: /var/lib/kubelet/plugins/csi-cephfsplugin
type: DirectoryOrCreate
- name: host-sys
hostPath:
path: /sys
- name: lib-modules
hostPath:
path: /lib/modules
- name: host-dev
hostPath:
path: /dev
4. 确认部署结果
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/csi-cephfsplugin-attacher-0 1/1 Running 0 26s
pod/csi-cephfsplugin-provisioner-0 1/1 Running 0 25s
pod/csi-cephfsplugin-rljcv 2/2 Running 0 24s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/csi-cephfsplugin-attacher ClusterIP 10.104.116.218 <none> 12345/TCP 27s
service/csi-cephfsplugin-provisioner ClusterIP 10.101.78.75 <none> 12345/TCP 26s
...
参考文档:
6.2.3 - 部署cephfs-provisioner
1. 安装cephfs客户端
所有node节点安装cephfs客户端,主要用来和ceph集群挂载使用。
yum install -y ceph-common
2. 部署RBAC
2.1. ClusterRole
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cephfs-provisioner
namespace: cephfs
rules:
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update", "patch"]
- apiGroups: [""]
resources: ["services"]
resourceNames: ["kube-dns","coredns"]
verbs: ["list", "get"]
2.2. ClusterRoleBinding
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cephfs-provisioner
subjects:
- kind: ServiceAccount
name: cephfs-provisioner
namespace: cephfs
roleRef:
kind: ClusterRole
name: cephfs-provisioner
apiGroup: rbac.authorization.k8s.io
2.3. Role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cephfs-provisioner
namespace: cephfs
rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["create", "get", "delete"]
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
2.4. RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cephfs-provisioner
namespace: cephfs
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cephfs-provisioner
subjects:
- kind: ServiceAccount
name: cephfs-provisioner
2.5. ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
name: cephfs-provisioner
namespace: cephfs
3. 部署 cephfs-provisioner
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: cephfs-provisioner
namespace: cephfs
spec:
replicas: 1
strategy:
type: Recreate
template:
metadata:
labels:
app: cephfs-provisioner
spec:
containers:
- name: cephfs-provisioner
image: "quay.io/external_storage/cephfs-provisioner:latest"
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 64Mi
env:
- name: PROVISIONER_NAME # 与storageclass的provisioner参数相同
value: ceph.com/cephfs
- name: PROVISIONER_SECRET_NAMESPACE # 与rbac的namespace相同
value: cephfs
command:
- "/usr/local/bin/cephfs-provisioner"
args:
- "-id=cephfs-provisioner-1"
serviceAccount: cephfs-provisioner
4. 部署storageclass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cephfs-provisioner-sc
provisioner: ceph.com/cephfs
volumeBindingMode: WaitForFirstConsumer
parameters:
monitors: 192.168.27.43:6789,192.168.27.44:6789,192.168.27.45:6789
adminId: admin
adminSecretName: csi-cephfs-secret
adminSecretNamespace: "kube-csi"
claimRoot: /pvc-volumes
5. 部署statefulset
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cephfs-provisioner-nginx
spec:
serviceName: "nginx"
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest #nginx的镜像
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: "/mnt" #容器里面的挂载目录,该目录挂载到NFS的共享目录上
name: test
volumeClaimTemplates:
- metadata:
name: test
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 2Gi
storageClassName: cephfs-provisioner-sc
6. 日志
6.1. cephfs-provisoner 执行日志
I0327 07:18:19.742239 1 controller.go:987] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": started
I0327 07:18:19.745239 1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-cephfs-ngx-wait-22-0", UID:"7f6b60d5-5060-11e9-9a9c-c81f66bcff65", APIVersion:"v1", ResourceVersion:"347214256", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test-cephfs-ngx-wait-22-0"
I0327 07:18:23.281277 1 cephfs-provisioner.go:222] successfully created CephFS share &CephFSPersistentVolumeSource{Monitors:[192.168.27.43:6789 192.168.27.44:6789 192.168.27.45:6789],Path:/pvc-volumes/kubernetes/kubernetes-dynamic-pvc-7f7cb62f-5060-11e9-85c0-0adb8ef08100,User:kubernetes-dynamic-user-7f7cb69f-5060-11e9-85c0-0adb8ef08100,SecretFile:,SecretRef:&SecretReference{Name:ceph-kubernetes-dynamic-user-7f7cb69f-5060-11e9-85c0-0adb8ef08100-secret,Namespace:default,},ReadOnly:false,}
I0327 07:18:23.281371 1 controller.go:1087] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": volume "pvc-7f6b60d5-5060-11e9-9a9c-c81f66bcff65" provisioned
I0327 07:18:23.281415 1 controller.go:1101] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": trying to save persistentvvolume "pvc-7f6b60d5-5060-11e9-9a9c-c81f66bcff65"
I0327 07:18:23.284621 1 controller.go:1108] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": persistentvolume "pvc-7f6b60d5-5060-11e9-9a9c-c81f66bcff65" saved
I0327 07:18:23.284723 1 controller.go:1149] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": succeeded
I0327 07:18:23.284810 1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-cephfs-ngx-wait-22-0", UID:"7f6b60d5-5060-11e9-9a9c-c81f66bcff65", APIVersion:"v1", ResourceVersion:"347214256", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-7f6b60d5-5060-11e9-9a9c-c81f66bcff65
6.2. debug 日志
I0327 08:08:11.789608 1 controller.go:987] provision "default/test-cephfs-ngx-wait-44-0" class "cephfs-sc-wait": started
I0327 08:08:11.793258 1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-cephfs-ngx-wait-44-0", UID:"81846859-5067-11e9-9a9c-c81f66bcff65", APIVersion:"v1", ResourceVersion:"347237916", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test-cephfs-ngx-wait-44-0"
E0327 08:08:12.164705 1 cephfs-provisioner.go:158] failed to provision share "kubernetes-dynamic-pvc-76ecdc5a-5067-11e9-9421-2a1b1be1aeef" for "kubernetes-dynamic-user-76ecdcee-5067-11e9-9421-2a1b1be1aeef", err: exit status 1, output: Traceback (most recent call last):
File "/usr/local/bin/cephfs_provisioner", line 364, in <module>
main()
File "/usr/local/bin/cephfs_provisioner", line 358, in main
print cephfs.create_share(share, user, size=size)
File "/usr/local/bin/cephfs_provisioner", line 228, in create_share
volume = self.volume_client.create_volume(volume_path, size=size, namespace_isolated=not self.ceph_namespace_isolation_disabled)
File "/usr/local/bin/cephfs_provisioner", line 112, in volume_client
self._volume_client.connect(None)
File "/lib/python2.7/site-packages/ceph_volume_client.py", line 458, in connect
self.rados.connect()
File "rados.pyx", line 895, in rados.Rados.connect (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.1/rpm/el7/BUILD/ceph-13.2.1/build/src/pybind/rados/pyrex/rados.c:9815)
rados.IOError: [errno 5] error connecting to the cluster
W0327 08:08:12.164908 1 controller.go:746] Retrying syncing claim "default/test-cephfs-ngx-wait-44-0" because failures 2 < threshold 15
E0327 08:08:12.164977 1 controller.go:761] error syncing claim "default/test-cephfs-ngx-wait-44-0": failed to provision volume with StorageClass "cephfs-sc-wait": exit status 1
I0327 08:08:12.165974 1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-cephfs-ngx-wait-44-0", UID:"81846859-5067-11e9-9a9c-c81f66bcff65", APIVersion:"v1", ResourceVersion:"347237916", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "cephfs-sc-wait": exit status 1
参考
6.2.4 -
1. FlexVolume介绍
Flexvolume提供了一种扩展k8s存储插件的方式,用户可以自定义自己的存储插件。类似的功能的实现还有CSI的方式。Flexvolume在k8s 1.8+以上版本提供GA功能版本。
2. 使用方式
在每个node节点安装存储插件二进制,该二进制实现flexvolume的相关接口,默认存储插件的存放路径为/usr/libexec/kubernetes/kubelet-plugins/volume/exec/<vendor~driver>/<driver>。
其中vendor~driver的名字需要和pod中flexVolume.driver的字段名字匹配,该字段名字通过/替换~。
例如:
-
path:/usr/libexec/kubernetes/kubelet-plugins/volume/exec/foo~cifs/cifs
-
pod中flexVolume.driver:foo/cifs
3. FlexVolume接口
节点上的存储插件需要实现以下的接口。
3.1. init
<driver executable> init
3.2. attach
<driver executable> attach <json options> <node name>
3.3. detach
<driver executable> detach <mount device> <node name>
3.4. waitforattach
<driver executable> waitforattach <mount device> <json options>
3.5. isattached
<driver executable> isattached <json options> <node name>
3.6. mountdevice
<driver executable> mountdevice <mount dir> <mount device> <json options>
3.7. unmountdevice
<driver executable> unmountdevice <mount device>
3.8. mount
<driver executable> mount <mount dir> <json options>
3.9. unmount
<driver executable> unmount <mount dir>
3.10. 插件输出
{
"status": "<Success/Failure/Not supported>",
"message": "<Reason for success/failure>",
"device": "<Path to the device attached. This field is valid only for attach & waitforattach call-outs>"
"volumeName": "<Cluster wide unique name of the volume. Valid only for getvolumename call-out>"
"attached": <True/False (Return true if volume is attached on the node. Valid only for isattached call-out)>
"capabilities": <Only included as part of the Init response>
{
"attach": <True/False (Return true if the driver implements attach and detach)>
}
}
4. 示例
4.1. pod的yaml文件内容
nginx-nfs.yaml
相关参数为flexVolume.driver等。
apiVersion: v1
kind: Pod
metadata:
name: nginx-nfs
namespace: default
spec:
containers:
- name: nginx-nfs
image: nginx
volumeMounts:
- name: test
mountPath: /data
ports:
- containerPort: 80
volumes:
- name: test
flexVolume:
driver: "k8s/nfs"
fsType: "nfs"
options:
server: "172.16.0.25"
share: "dws_nas_scratch"
4.2. 插件脚本
nfs脚本实现了flexvolume的接口。
/usr/libexec/kubernetes/kubelet-plugins/volume/exec/k8s~nfs/nfs。
#!/bin/bash
# Copyright 2015 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Notes:
# - Please install "jq" package before using this driver.
usage() {
err "Invalid usage. Usage: "
err "\t$0 init"
err "\t$0 mount <mount dir> <json params>"
err "\t$0 unmount <mount dir>"
exit 1
}
err() {
echo -ne $* 1>&2
}
log() {
echo -ne $* >&1
}
ismounted() {
MOUNT=`findmnt -n ${MNTPATH} 2>/dev/null | cut -d' ' -f1`
if [ "${MOUNT}" == "${MNTPATH}" ]; then
echo "1"
else
echo "0"
fi
}
domount() {
MNTPATH=$1
NFS_SERVER=$(echo $2 | jq -r '.server')
SHARE=$(echo $2 | jq -r '.share')
if [ $(ismounted) -eq 1 ] ; then
log '{"status": "Success"}'
exit 0
fi
mkdir -p ${MNTPATH} &> /dev/null
mount -t nfs ${NFS_SERVER}:/${SHARE} ${MNTPATH} &> /dev/null
if [ $? -ne 0 ]; then
err "{ \"status\": \"Failure\", \"message\": \"Failed to mount ${NFS_SERVER}:${SHARE} at ${MNTPATH}\"}"
exit 1
fi
log '{"status": "Success"}'
exit 0
}
unmount() {
MNTPATH=$1
if [ $(ismounted) -eq 0 ] ; then
log '{"status": "Success"}'
exit 0
fi
umount ${MNTPATH} &> /dev/null
if [ $? -ne 0 ]; then
err "{ \"status\": \"Failed\", \"message\": \"Failed to unmount volume at ${MNTPATH}\"}"
exit 1
fi
log '{"status": "Success"}'
exit 0
}
op=$1
if ! command -v jq >/dev/null 2>&1; then
err "{ \"status\": \"Failure\", \"message\": \"'jq' binary not found. Please install jq package before using this driver\"}"
exit 1
fi
if [ "$op" = "init" ]; then
log '{"status": "Success", "capabilities": {"attach": false}}'
exit 0
fi
if [ $# -lt 2 ]; then
usage
fi
shift
case "$op" in
mount)
domount $*
;;
unmount)
unmount $*
;;
*)
log '{"status": "Not supported"}'
exit 0
esac
exit 1
参考:
- https://github.com/kubernetes/community/blob/master/contributors/devel/sig-storage/flexvolume.md
- https://github.com/kubernetes/examples/tree/master/staging/volumes/flexvolume
- https://github.com/kubernetes/examples/blob/master/staging/volumes/flexvolume/nginx-nfs.yaml
- https://github.com/kubernetes/examples/blob/master/staging/volumes/flexvolume/nfs
7 - 资源隔离
7.1 -
资源配额(ResourceQuota)
ResourceQuota对象用来定义某个命名空间下所有资源的使用限额,其实包括:
- 计算资源的配额
- 存储资源的配额
- 对象数量的配额
如果集群的总容量小于命名空间的配额总额,可能会产生资源竞争。这时会按照先到先得来处理。 资源竞争和配额的更新都不会影响已经创建好的资源。
1. 启动资源配额
Kubernetes 的众多发行版本默认开启了资源配额的支持。当在apiserver的--admission-control配置中添加ResourceQuota参数后,便启用了。 当一个命名空间中含有ResourceQuota对象时,资源配额将强制执行。
2. 计算资源配额
可以在给定的命名空间中限制可以请求的计算资源(compute resources)的总量。
| 资源名称 | 描述 |
|---|---|
| cpu | 非终止态的所有pod, cpu请求总量不能超出此值。 |
| limits.cpu | 非终止态的所有pod, cpu限制总量不能超出此值。 |
| limits.memory | 非终止态的所有pod, 内存限制总量不能超出此值。 |
| memory | 非终止态的所有pod, 内存请求总量不能超出此值。 |
| requests.cpu | 非终止态的所有pod, cpu请求总量不能超出此值。 |
| requests.memory | 非终止态的所有pod, 内存请求总量不能超出此值。 |
3. 存储资源配额
可以在给定的命名空间中限制可以请求的存储资源(storage resources)的总量。
| 资源名称 | 描述 |
|---|---|
| requests.storage | 所有PVC, 存储请求总量不能超出此值。 |
| persistentvolumeclaims | 命名空间中可以存在的PVC(persistent volume claims)总数。 |
| .storageclass.storage.k8s.io/requests.storage | 和该存储类关联的所有PVC, 存储请求总和不能超出此值。 |
| .storageclass.storage.k8s.io/persistentvolumeclaims | 和该存储类关联的所有PVC,命名空间中可以存在的PVC(persistent volume claims)总数。 |
4. 对象数量的配额
| 资源名称 | 描述 |
|---|---|
| congfigmaps | 命名空间中可以存在的配置映射的总数。 |
| persistentvolumeclaims | 命名空间中可以存在的PVC总数。 |
| pods | 命名空间中可以存在的非终止态的pod总数。如果一个pod的status.phase 是 Failed, Succeeded, 则该pod处于终止态。 |
| replicationcontrollers | 命名空间中可以存在的rc总数。 |
| resourcequotas | 命名空间中可以存在的资源配额(resource quotas)总数。 |
| services | 命名空间中可以存在的服务总数量。 |
| services.loadbalancers | 命名空间中可以存在的服务的负载均衡的总数量。 |
| services.nodeports | 命名空间中可以存在的服务的主机接口的总数量。 |
| secrets | 命名空间中可以存在的secrets的总数量。 |
例如:可以定义pod的限额来避免某用户消耗过多的Pod IPs。
5. 限额的作用域
| 作用域 | 描述 |
|---|---|
| Terminating | 匹配 spec.activeDeadlineSeconds >= 0 的pod |
| NotTerminating | 匹配 spec.activeDeadlineSeconds is nil 的pod |
| BestEffort | 匹配具有最佳服务质量的pod |
| NotBestEffort | 匹配具有非最佳服务质量的pod |
6. request和limit
当分配计算资源时,每个容器可以为cpu或者内存指定一个请求值和一个限度值。可以配置限额值来限制它们中的任何一个值。
如果指定了requests.cpu 或者 requests.memory的限额值,那么就要求传入的每一个容器显式的指定这些资源的请求。如果指定了limits.cpu或者limits.memory,那么就要求传入的每一个容器显式的指定这些资源的限度。
7. 查看和设置配额
# 创建namespace
$ kubectl create namespace myspace
# 创建resourcequota
$ cat <<EOF > compute-resources.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "4"
requests.cpu: "1"
requests.memory: 1Gi
limits.cpu: "2"
limits.memory: 2Gi
EOF
$ kubectl create -f ./compute-resources.yaml --namespace=myspace
# 查询resourcequota
$ kubectl get quota --namespace=myspace
NAME AGE
compute-resources 30s
# 查询resourcequota的详细信息
$ kubectl describe quota compute-resources --namespace=myspace
Name: compute-resources
Namespace: myspace
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
pods 0 4
requests.cpu 0 1
requests.memory 0 1Gi
8. 配额和集群容量
资源配额对象与集群容量无关,它们以绝对单位表示。即增加节点的资源并不会增加已经配置的namespace的资源。
参考文章:
7.2 -
Pod限额(LimitRange)
ResourceQuota对象是限制某个namespace下所有Pod(容器)的资源限额
LimitRange对象是限制某个namespace单个Pod(容器)的资源限额
LimitRange对象用来定义某个命名空间下某种资源对象的使用限额,其中资源对象包括:Pod、Container、PersistentVolumeClaim。
1. 为namespace配置CPU和内存的默认值
如果在一个拥有默认内存或CPU限额的命名空间中创建一个容器,并且这个容器未指定它自己的内存或CPU的limit, 它会被分配这个默认的内存或CPU的limit。既没有设置pod的limit和request才会分配默认的内存或CPU的request。
1.1. namespace的内存默认值
# 创建namespace
$ kubectl create namespace default-mem-example
# 创建LimitRange
$ cat memory-defaults.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: mem-limit-range
spec:
limits:
- default:
memory: 512Mi
defaultRequest:
memory: 256Mi
type: Container
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-defaults.yaml --namespace=default-mem-example
# 创建Pod,未指定内存的limit和request
$ cat memory-defaults-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo
spec:
containers:
- name: default-mem-demo-ctr
image: nginx
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-defaults-pod.yaml --namespace=default-mem-example
# 查看Pod
$ kubectl get pod default-mem-demo --output=yaml --namespace=default-mem-example
containers:
- image: nginx
imagePullPolicy: Always
name: default-mem-demo-ctr
resources:
limits:
memory: 512Mi
requests:
memory: 256Mi
1.2. namespace的CPU默认值
# 创建namespace
$ kubectl create namespace default-cpu-example
# 创建LimitRange
$ cat cpu-defaults.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-limit-range
spec:
limits:
- default:
cpu: 1
defaultRequest:
cpu: 0.5
type: Container
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-defaults.yaml --namespace=default-cpu-example
# 创建Pod,未指定CPU的limit和request
$ cat cpu-defaults-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo
spec:
containers:
- name: default-cpu-demo-ctr
image: nginx
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-defaults-pod.yaml --namespace=default-cpu-example
# 查看Pod
$ kubectl get pod default-cpu-demo --output=yaml --namespace=default-cpu-example
containers:
- image: nginx
imagePullPolicy: Always
name: default-cpu-demo-ctr
resources:
limits:
cpu: "1"
requests:
cpu: 500m
1.3 说明
- 如果没有指定pod的
request和limit,则创建的pod会使用LimitRange对象定义的默认值(request和limit) - 如果指定pod的
limit但未指定request,则创建的pod的request值会取limit的值,而不会取LimitRange对象定义的request默认值。 - 如果指定pod的
request但未指定limit,则创建的pod的limit值会取LimitRange对象定义的limit默认值。
默认Limit和request的动机
如果命名空间具有资源配额(ResourceQuota), 它为内存限额(CPU限额)设置默认值是有意义的。 以下是资源配额对命名空间施加的两个限制:
- 在命名空间运行的每一个容器必须有它自己的内存限额(CPU限额)。
- 在命名空间中所有的容器使用的内存总量(CPU总量)不能超出指定的限额。
如果一个容器没有指定它自己的内存限额(CPU限额),它将被赋予默认的限额值,然后它才可以在被配额限制的命名空间中运行。
2. 为namespace配置CPU和内存的最大最小值
2.1. 内存的最大最小值
创建LimitRange
# 创建namespace
$ kubectl create namespace constraints-mem-example
# 创建LimitRange
$ cat memory-constraints.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: mem-min-max-demo-lr
spec:
limits:
- max:
memory: 1Gi
min:
memory: 500Mi
type: Container
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints.yaml --namespace=constraints-mem-example
# 查看LimitRange
$ kubectl get limitrange cpu-min-max-demo --namespace=constraints-mem-example --output=yaml
...
limits:
- default:
memory: 1Gi
defaultRequest:
memory: 1Gi
max:
memory: 1Gi
min:
memory: 500Mi
type: Container
...
# LimitRange设置了最大最小值,但没有设置默认值,也会被自动设置默认值。
创建符合要求的Pod
# 创建符合要求的Pod
$ cat memory-constraints-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo
spec:
containers:
- name: constraints-mem-demo-ctr
image: nginx
resources:
limits:
memory: "800Mi"
requests:
memory: "600Mi"
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints-pod.yaml --namespace=constraints-mem-example
# 查看Pod
$ kubectl get pod constraints-mem-demo --output=yaml --namespace=constraints-mem-example
...
resources:
limits:
memory: 800Mi
requests:
memory: 600Mi
...
创建超过最大内存limit的pod
$ cat memory-constraints-pod-2.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-2
spec:
containers:
- name: constraints-mem-demo-2-ctr
image: nginx
resources:
limits:
memory: "1.5Gi" # 超过最大值 1Gi
requests:
memory: "800Mi"
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints-pod-2.yaml --namespace=constraints-mem-example
# Pod创建失败,因为容器指定的limit过大
Error from server (Forbidden): error when creating "docs/tasks/administer-cluster/memory-constraints-pod-2.yaml":
pods "constraints-mem-demo-2" is forbidden: maximum memory usage per Container is 1Gi, but limit is 1536Mi.
创建小于最小内存request的Pod
$ cat memory-constraints-pod-3.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-3
spec:
containers:
- name: constraints-mem-demo-3-ctr
image: nginx
resources:
limits:
memory: "800Mi"
requests:
memory: "100Mi" # 小于最小值500Mi
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints-pod-3.yaml --namespace=constraints-mem-example
# Pod创建失败,因为容器指定的内存request过小
Error from server (Forbidden): error when creating "docs/tasks/administer-cluster/memory-constraints-pod-3.yaml":
pods "constraints-mem-demo-3" is forbidden: minimum memory usage per Container is 500Mi, but request is 100Mi.
创建没有指定任何内存limit和request的pod
$ cat memory-constraints-pod-4.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-4
spec:
containers:
- name: constraints-mem-demo-4-ctr
image: nginx
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints-pod-4.yaml --namespace=constraints-mem-example
# 查看Pod
$ kubectl get pod constraints-mem-demo-4 --namespace=constraints-mem-example --output=yaml
...
resources:
limits:
memory: 1Gi
requests:
memory: 1Gi
...
容器没有指定自己的 CPU 请求和限制,所以它将从 LimitRange 获取默认的 CPU 请求和限制值。
2.2. CPU的最大最小值
创建LimitRange
# 创建namespace
$ kubectl create namespace constraints-cpu-example
# 创建LimitRange
$ cat cpu-constraints.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-min-max-demo-lr
spec:
limits:
- max:
cpu: "800m"
min:
cpu: "200m"
type: Container
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints.yaml --namespace=constraints-cpu-example
# 查看LimitRange
$ kubectl get limitrange cpu-min-max-demo-lr --output=yaml --namespace=constraints-cpu-example
...
limits:
- default:
cpu: 800m
defaultRequest:
cpu: 800m
max:
cpu: 800m
min:
cpu: 200m
type: Container
...
创建符合要求的Pod
$ cat cpu-constraints-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo
spec:
containers:
- name: constraints-cpu-demo-ctr
image: nginx
resources:
limits:
cpu: "800m"
requests:
cpu: "500m"
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints-pod.yaml --namespace=constraints-cpu-example
# 查看Pod
$ kubectl get pod constraints-cpu-demo --output=yaml --namespace=constraints-cpu-example
...
resources:
limits:
cpu: 800m
requests:
cpu: 500m
...
创建超过最大CPU limit的Pod
$ cat cpu-constraints-pod-2.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-2
spec:
containers:
- name: constraints-cpu-demo-2-ctr
image: nginx
resources:
limits:
cpu: "1.5"
requests:
cpu: "500m"
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints-pod-2.yaml --namespace=constraints-cpu-example
# Pod创建失败,因为容器指定的CPU limit过大
Error from server (Forbidden): error when creating "docs/tasks/administer-cluster/cpu-constraints-pod-2.yaml":
pods "constraints-cpu-demo-2" is forbidden: maximum cpu usage per Container is 800m, but limit is 1500m.
创建小于最小CPU request的Pod
$ cat cpu-constraints-pod-3.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-4
spec:
containers:
- name: constraints-cpu-demo-4-ctr
image: nginx
resources:
limits:
cpu: "800m"
requests:
cpu: "100m"
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints-pod-3.yaml --namespace=constraints-cpu-example
# Pod创建失败,因为容器指定的CPU request过小
Error from server (Forbidden): error when creating "docs/tasks/administer-cluster/cpu-constraints-pod-3.yaml":
pods "constraints-cpu-demo-4" is forbidden: minimum cpu usage per Container is 200m, but request is 100m.
创建没有指定任何CPU limit和request的pod
$ cat cpu-constraints-pod-4.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-4
spec:
containers:
- name: constraints-cpu-demo-4-ctr
image: vish/stress
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints-pod-4.yaml --namespace=constraints-cpu-example
# 查看Pod
kubectl get pod constraints-cpu-demo-4 --namespace=constraints-cpu-example --output=yaml
...
resources:
limits:
cpu: 800m
requests:
cpu: 800m
...
容器没有指定自己的 CPU 请求和限制,所以它将从 LimitRange 获取默认的 CPU 请求和限制值。
2.3. 说明
LimitRange 在 namespace 中施加的最小和最大内存(CPU)限制只有在创建和更新 Pod 时才会被应用。改变 LimitRange 不会对之前创建的 Pod 造成影响。
Kubernetes 都会执行下列步骤:
- 如果容器没有指定自己的内存(CPU)请求(request)和限制(limit),系统将会为其分配默认值。
- 验证容器的内存(CPU)请求大于等于最小值。
- 验证容器的内存(CPU)限制小于等于最大值。
参考文章:
-
https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/
-
https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/cpu-default-namespace/
-
https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-constraint-namespace/
-
https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/cpu-constraint-namespace/
-
https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/quota-memory-cpu-namespace/
7.3 -
Resource Quality of Service
1. 资源QoS简介
request值表示容器保证可被分配到资源。limit表示容器可允许使用的最大资源。Pod级别的request和limit是其所有容器的request和limit之和。
2. Requests and Limits
Pod可以指定request和limit资源。其中0 <= request <=Node Allocatable & request <= limit <= Infinity。调度是基于request而不是limit,即如果Pod被成功调度,那么可以保证Pod分配到指定的 request的资源。Pod使用的资源能否超过指定的limit值取决于该资源是否可被压缩。
2.1. 可压缩的资源
- 目前只支持CPU
- pod可以保证获得它们请求的CPU数量,它们可能会也可能不会获得额外的CPU时间(取决于正在运行的其他作业)。因为目前CPU隔离是在容器级别而不是pod级别。
2.2. 不可压缩的资源
- 目前只支持内存
- pod将获得它们请求的内存数量,如果超过了它们的内存请求,它们可能会被杀死(如果其他一些pod需要内存),但如果pod消耗的内存小于请求的内存,那么它们将不会被杀死(除非在系统任务或守护进程需要更多内存的情况下)。
3. QoS 级别
在机器资源超卖的情况下(limit的总量大于机器的资源容量),即CPU或内存耗尽,将不得不杀死部分不重要的容器。因此对容器分成了3个QoS的级别:Guaranteed, Burstable, Best-Effort,三个级别的优先级依次递减。
当CPU资源无法满足,pod不会被杀死可能被短暂控制。
内存是不可压缩的资源,当内存耗尽的情况下,会依次杀死优先级低的容器。Guaranteed的级别最高,不会被杀死,除非容器使用量超过limit限值或者资源耗尽,已经没有更低级别的容器可驱逐。
3.1. Guaranteed
所有的容器的limit值和request值被配置且两者相等(如果只配置limit没有request,则request取值于limit)。
例如:
# 示例1
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
# 示例2
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
3.2. Burstable
如果一个或多个容器的limit和request值被配置且两者不相等。
例如:
# 示例1
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
# 示例2
containers:
name: foo
resources:
limits:
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
# 示例3
containers:
name: foo
resources:
requests:
cpu: 10m
memory: 1Gi
name: bar
3.3. Best-Effort
所有的容器的limit和request值都没有配置。
例如:
containers:
name: foo
resources:
name: bar
resources:
参考文章:
7.4 - Lxcfs资源视图隔离
1. 资源视图隔离
容器中的执行top、free等命令展示出来的CPU,内存等信息是从/proc目录中的相关文件里读取出来的。而容器并没有对/proc,/sys等文件系统做隔离,因此容器中读取出来的CPU和内存的信息是宿主机的信息,与容器实际分配和限制的资源量不同。
/proc/cpuinfo
/proc/diskstats
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime
为了实现让容器内部的资源视图更像虚拟机,使得应用程序可以拿到真实的CPU和内存信息,就需要通过文件挂载的方式将cgroup的真实的容器资源信息挂载到容器内/proc下的文件,使得容器内执行top、free等命令时可以拿到真实的CPU和内存信息。
2. Lxcfs简介
lxcfs是一个FUSE文件系统,使得Linux容器的文件系统更像虚拟机。lxcfs是一个常驻进程运行在宿主机上,从而来自动维护宿主机cgroup中容器的真实资源信息与容器内/proc下文件的映射关系。
lxcfs的命令信息如下:
#/usr/local/bin/lxcfs -h
Usage:
lxcfs [-f|-d] -u -l -n [-p pidfile] mountpoint
-f running foreground by default; -d enable debug output
-l use loadavg
-u no swap
Default pidfile is /run/lxcfs.pid
lxcfs -h
lxcfs的源码:https://github.com/lxc/lxcfs
3. Lxcfs原理
lxcfs实现的基本原理是通过文件挂载的方式,把cgroup中容器相关的信息读取出来,存储到lxcfs相关的目录下,并将相关目录映射到容器内的/proc目录下,从而使得容器内执行top,free等命令时拿到的/proc下的数据是真实的cgroup分配给容器的CPU和内存数据。
原理图

映射目录
| 类别 | 容器内目录 | 宿主机lxcfs目录 |
|---|---|---|
| cpu | /proc/cpuinfo | /var/lib/lxcfs/{container_id}/proc/cpuinfo |
| 内存 | /proc/meminfo | /var/lib/lxcfs/{container_id}/proc/meminfo |
| /proc/diskstats | /var/lib/lxcfs/{container_id}/proc/diskstats | |
| /proc/stat | /var/lib/lxcfs/{container_id}/proc/stat | |
| /proc/swaps | /var/lib/lxcfs/{container_id}/proc/swaps | |
| /proc/uptime | /var/lib/lxcfs/{container_id}/proc/uptime | |
| /proc/loadavg | /var/lib/lxcfs/{container_id}/proc/loadavg | |
| /sys/devices/system/cpu/online | /var/lib/lxcfs/{container_id}/sys/devices/system/cpu/online |
4. 使用方式
4.1. 安装lxcfs
环境准备
yum install -y fuse fuse-lib fuse-devel
源码编译安装
git clone git://github.com/lxc/lxcfs
cd lxcfs
./bootstrap.sh
./configure
make
make install
或者通过rpm包安装
wget https://copr-be.cloud.fedoraproject.org/results/ganto/lxc3/epel-7-x86_64/01041891-lxcfs/lxcfs-3.1.2-0.2.el7.x86_64.rpm;
rpm -ivh lxcfs-3.1.2-0.2.el7.x86_64.rpm --force --nodeps
查看是否安装成功
lxcfs -h
4.2. 运行lxcfs
运行lxcfs主要执行两条命令。
sudo mkdir -p /var/lib/lxcfs
sudo lxcfs /var/lib/lxcfs
可以通过systemd运行。
lxcfs.service文件:
cat > /usr/lib/systemd/system/lxcfs.service <<EOF
[Unit]
Description=lxcfs
[Service]
ExecStart=/usr/bin/lxcfs -f /var/lib/lxcfs
Restart=on-failure
#ExecReload=/bin/kill -s SIGHUP $MAINPID
[Install]
WantedBy=multi-user.target
EOF
运行命令
systemctl daemon-reload && systemctl enable lxcfs && systemctl start lxcfs && systemctl status lxcfs
4.3. 挂载容器内/proc下的文件目录
docker run -it --rm -m 256m --cpus 2 \
-v /var/lib/lxcfs/proc/cpuinfo:/proc/cpuinfo:rw \
-v /var/lib/lxcfs/proc/diskstats:/proc/diskstats:rw \
-v /var/lib/lxcfs/proc/meminfo:/proc/meminfo:rw \
-v /var/lib/lxcfs/proc/stat:/proc/stat:rw \
-v /var/lib/lxcfs/proc/swaps:/proc/swaps:rw \
-v /var/lib/lxcfs/proc/uptime:/proc/uptime:rw \
nginx:latest /bin/sh
4.4. 验证容器内CPU和内存
# cpu
grep -c processor /proc/cpuinfo
cat /proc/cpuinfo
# memory
free -g
cat /proc/meminfo
5. 使用k8s集群部署
使用k8s集群部署与systemd部署方式同理,需要解决2个问题:
- 在每个node节点上部署lxcfs常驻进程,lxcfs需要通过镜像来运行,可以通过daemonset来部署。
- 实现将lxcfs维护的目录自动挂载到pod内的
/proc目录。
具体可参考:https://github.com/denverdino/lxcfs-admission-webhook
5.1. lxcfs-image
FROM centos:7 as build
RUN yum -y update
RUN yum -y install fuse-devel pam-devel wget install gcc automake autoconf libtool make
ENV LXCFS_VERSION 3.1.2
RUN wget https://linuxcontainers.org/downloads/lxcfs/lxcfs-$LXCFS_VERSION.tar.gz && \
mkdir /lxcfs && tar xzvf lxcfs-$LXCFS_VERSION.tar.gz -C /lxcfs --strip-components=1 && \
cd /lxcfs && ./configure && make
FROM centos:7
STOPSIGNAL SIGINT
COPY --from=build /lxcfs/lxcfs /usr/local/bin/lxcfs
COPY --from=build /lxcfs/.libs/liblxcfs.so /usr/local/lib/lxcfs/liblxcfs.so
COPY --from=build /lxcfs/lxcfs /lxcfs/lxcfs
COPY --from=build /lxcfs/.libs/liblxcfs.so /lxcfs/liblxcfs.so
COPY --from=build /usr/lib64/libfuse.so.2.9.2 /usr/lib64/libfuse.so.2.9.2
COPY --from=build /usr/lib64/libulockmgr.so.1.0.1 /usr/lib64/libulockmgr.so.1.0.1
RUN ln -s /usr/lib64/libfuse.so.2.9.2 /usr/lib64/libfuse.so.2 && \
ln -s /usr/lib64/libulockmgr.so.1.0.1 /usr/lib64/libulockmgr.so.1
COPY start.sh /
CMD ["/start.sh"]
star.sh
#!/bin/bash
# Cleanup
nsenter -m/proc/1/ns/mnt fusermount -u /var/lib/lxcfs 2> /dev/null || true
nsenter -m/proc/1/ns/mnt [ -L /etc/mtab ] || \
sed -i "/^lxcfs \/var\/lib\/lxcfs fuse.lxcfs/d" /etc/mtab
# Prepare
mkdir -p /usr/local/lib/lxcfs /var/lib/lxcfs
# Update lxcfs
cp -f /lxcfs/lxcfs /usr/local/bin/lxcfs
cp -f /lxcfs/liblxcfs.so /usr/local/lib/lxcfs/liblxcfs.so
# Mount
exec nsenter -m/proc/1/ns/mnt /usr/local/bin/lxcfs /var/lib/lxcfs/
5.2. daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: lxcfs
labels:
app: lxcfs
spec:
selector:
matchLabels:
app: lxcfs
template:
metadata:
labels:
app: lxcfs
spec:
hostPID: true
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: lxcfs
image: registry.cn-hangzhou.aliyuncs.com/denverdino/lxcfs:3.1.2
imagePullPolicy: Always
securityContext:
privileged: true
volumeMounts:
- name: cgroup
mountPath: /sys/fs/cgroup
- name: lxcfs
mountPath: /var/lib/lxcfs
mountPropagation: Bidirectional
- name: usr-local
mountPath: /usr/local
volumes:
- name: cgroup
hostPath:
path: /sys/fs/cgroup
- name: usr-local
hostPath:
path: /usr/local
- name: lxcfs
hostPath:
path: /var/lib/lxcfs
type: DirectoryOrCreate
5.3. lxcfs-admission-webhook
lxcfs-admission-webhook实现了一个动态的准入webhook,更准确的讲是实现了一个修改性质的webhook,即监听pod的创建,然后对pod执行patch的操作,从而将lxcfs与容器内的目录映射关系植入到pod创建的yaml中从而实现自动挂载。
apiVersion: apps/v1
kind: Deployment
metadata:
name: lxcfs-admission-webhook-deployment
labels:
app: lxcfs-admission-webhook
spec:
replicas: 1
selector:
matchLabels:
app: lxcfs-admission-webhook
template:
metadata:
labels:
app: lxcfs-admission-webhook
spec:
containers:
- name: lxcfs-admission-webhook
image: registry.cn-hangzhou.aliyuncs.com/denverdino/lxcfs-admission-webhook:v1
imagePullPolicy: IfNotPresent
args:
- -tlsCertFile=/etc/webhook/certs/cert.pem
- -tlsKeyFile=/etc/webhook/certs/key.pem
- -alsologtostderr
- -v=4
- 2>&1
volumeMounts:
- name: webhook-certs
mountPath: /etc/webhook/certs
readOnly: true
volumes:
- name: webhook-certs
secret:
secretName: lxcfs-admission-webhook-certs
具体部署参考:install.sh
#!/bin/bash
./deployment/webhook-create-signed-cert.sh
kubectl get secret lxcfs-admission-webhook-certs
kubectl create -f deployment/deployment.yaml
kubectl create -f deployment/service.yaml
cat ./deployment/mutatingwebhook.yaml | ./deployment/webhook-patch-ca-bundle.sh > ./deployment/mutatingwebhook-ca-bundle.yaml
kubectl create -f deployment/mutatingwebhook-ca-bundle.yaml
执行命令
/deployment/install.sh
参考:
8 - 运维指南
8.1 - Kubernetes集群问题排查
1. 查看系统Event事件
kubectl describe pod <PodName> --namespace=<NAMESPACE>
该命令可以显示Pod创建时的配置定义、状态等信息和最近的Event事件,事件信息可用于排错。例如当Pod状态为Pending,可通过查看Event事件确认原因,一般原因有几种:
- 没有可用的Node可调度
- 开启了资源配额管理并且当前Pod的目标节点上恰好没有可用的资源
- 正在下载镜像(镜像拉取耗时太久)或镜像下载失败。
kubectl describe还可以查看其它k8s对象:NODE,RC,Service,Namespace,Secrets。
1.1. Pod
kubectl describe pod <PodName> --namespace=<NAMESPACE>
以下是容器的启动命令非阻塞式导致容器挂掉,被k8s频繁重启所产生的事件。
kubectl describe pod <PodName> --namespace=<NAMESPACE>
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
───────── ──────── ───── ──── ───────────── ────── ───────
7m 7m 1 {scheduler } Scheduled Successfully assigned yangsc-1-0-0-index0 to 10.8.216.19
7m 7m 1 {kubelet 10.8.216.19} containers{infra} Pulled Container image "gcr.io/kube-system/pause:0.8.0" already present on machine
7m 7m 1 {kubelet 10.8.216.19} containers{infra} Created Created with docker id 84f133c324d0
7m 7m 1 {kubelet 10.8.216.19} containers{infra} Started Started with docker id 84f133c324d0
7m 7m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 3f9f82abb145
7m 7m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 3f9f82abb145
7m 7m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id fb112e4002f4
7m 7m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id fb112e4002f4
6m 6m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 613b119d4474
6m 6m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 613b119d4474
6m 6m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 25cb68d1fd3d
6m 6m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 25cb68d1fd3d
5m 5m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 7d9ee8610b28
5m 5m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 7d9ee8610b28
3m 3m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 88b9e8d582dd
3m 3m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 88b9e8d582dd
7m 1m 7 {kubelet 10.8.216.19} containers{yangsc0} Pulling Pulling image "gcr.io/test/tcp-hello:1.0.0"
1m 1m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 089abff050e7
1m 1m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 089abff050e7
7m 1m 7 {kubelet 10.8.216.19} containers{yangsc0} Pulled Successfully pulled image "gcr.io/test/tcp-hello:1.0.0"
6m 7s 34 {kubelet 10.8.216.19} containers{yangsc0} Backoff Back-off restarting failed docker container
1.2. NODE
kubectl describe node 10.8.216.20
[root@FC-43745A-10 ~]# kubectl describe node 10.8.216.20
Name: 10.8.216.20
Labels: kubernetes.io/hostname=10.8.216.20,namespace/bcs-cc=true,namespace/myview=true
CreationTimestamp: Mon, 17 Apr 2017 11:32:52 +0800
Phase:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
──── ────── ───────────────── ────────────────── ────── ───────
Ready True Fri, 18 Aug 2017 09:38:33 +0800 Tue, 02 May 2017 17:40:58 +0800 KubeletReady kubelet is posting ready status
OutOfDisk False Fri, 18 Aug 2017 09:38:33 +0800 Mon, 17 Apr 2017 11:31:27 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
Addresses: 10.8.216.20,10.8.216.20
Capacity:
cpu: 32
memory: 67323039744
pods: 40
System Info:
Machine ID: 723bafc7f6764022972b3eae1ce6b198
System UUID: 4C4C4544-0042-4210-8044-C3C04F595631
Boot ID: da01f2e3-987a-425a-9ca7-1caaec35d1e5
Kernel Version: 3.10.0-327.28.3.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.1.1-xxx2-13.1+79c90c68bfb72f-dirty
Kube-Proxy Version: v1.1.1-xxx2-13.1+79c90c68bfb72f-dirty
ExternalID: 10.8.216.20
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
───────── ──── ──────────── ────────── ─────────────── ─────────────
bcs-cc bcs-cc-api-0-0-1364-index0 1 (3%) 1 (3%) 4294967296 (6%) 4294967296 (6%)
bcs-cc bcs-cc-api-0-0-1444-index0 1 (3%) 1 (3%) 4294967296 (6%) 4294967296 (6%)
fw fw-demo2-0-0-1519-index0 1 (3%) 1 (3%) 4294967296 (6%) 4294967296 (6%)
myview myview-api-0-0-1362-index0 1 (3%) 1 (3%) 4294967296 (6%) 4294967296 (6%)
myview myview-api-0-0-1442-index0 1 (3%) 1 (3%) 4294967296 (6%) 4294967296 (6%)
qa-ts-dna ts-dna-console3-0-0-1434-index0 1 (3%) 1 (3%) 4294967296 (6%) 4294967296 (6%)
Allocated resources:
(Total limits may be over 100%, i.e., overcommitted. More info: http://releases.k8s.io/HEAD/docs/user-guide/compute-resources.md)
CPU Requests CPU Limits Memory Requests Memory Limits
──────────── ────────── ─────────────── ─────────────
6 (18%) 6 (18%) 25769803776 (38%) 25769803776 (38%)
No events.
1.3. RC
kubectl describe rc mytest-1-0-0 --namespace=test
[root@FC-43745A-10 ~]# kubectl describe rc mytest-1-0-0 --namespace=test
Name: mytest-1-0-0
Namespace: test
Image(s): gcr.io/test/mywebcalculator:1.0.1
Selector: app=mytest,appVersion=1.0.0
Labels: app=mytest,appVersion=1.0.0,env=ts,zone=inner
Replicas: 1 current / 1 desired
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
───────── ──────── ───── ──── ───────────── ────── ───────
20h 19h 9 {replication-controller } FailedCreate Error creating: Pod "mytest-1-0-0-index0" is forbidden: limited to 10 pods
20h 17h 7 {replication-controller } FailedCreate Error creating: pods "mytest-1-0-0-index0" already exists
20h 17h 4 {replication-controller } SuccessfulCreate Created pod: mytest-1-0-0-index0
1.4. NAMESPACE
kubectl describe namespace test
[root@FC-43745A-10 ~]# kubectl describe namespace test
Name: test
Labels: <none>
Status: Active
Resource Quotas
Resource Used Hard
--- --- ---
cpu 5 20
memory 1342177280 53687091200
persistentvolumeclaims 0 10
pods 4 10
replicationcontrollers 8 20
resourcequotas 1 1
secrets 3 10
services 8 20
No resource limits.
1.5. Service
kubectl describe service xxx-containers-1-1-0 --namespace=test
[root@FC-43745A-10 ~]# kubectl describe service xxx-containers-1-1-0 --namespace=test
Name: xxx-containers-1-1-0
Namespace: test
Labels: app=xxx-containers,appVersion=1.1.0,env=ts,zone=inner
Selector: app=xxx-containers,appVersion=1.1.0
Type: ClusterIP
IP: 10.254.46.42
Port: port-dna-tcp-35913 35913/TCP
Endpoints: 10.0.92.17:35913
Port: port-l7-tcp-8080 8080/TCP
Endpoints: 10.0.92.17:8080
Session Affinity: None
No events.
2. 查看容器日志
1、查看指定pod的日志
kubectl logs <pod_name>
kubectl logs -f <pod_name> #类似tail -f的方式查看
2、查看上一个pod的日志
kubectl logs -p <pod_name>
3、查看指定pod中指定容器的日志
kubectl logs <pod_name> -c <container_name>
4、kubectl logs --help
[root@node5 ~]# kubectl logs --help
Print the logs for a container in a pod. If the pod has only one container, the container name is optional.
Usage:
kubectl logs [-f] [-p] POD [-c CONTAINER] [flags]
Aliases:
logs, log
Examples:
# Return snapshot logs from pod nginx with only one container
$ kubectl logs nginx
# Return snapshot of previous terminated ruby container logs from pod web-1
$ kubectl logs -p -c ruby web-1
# Begin streaming the logs of the ruby container in pod web-1
$ kubectl logs -f -c ruby web-1
# Display only the most recent 20 lines of output in pod nginx
$ kubectl logs --tail=20 nginx
# Show all logs from pod nginx written in the last hour
$ kubectl logs --since=1h nginx
3. 查看k8s服务日志
3.1. journalctl
在Linux系统上systemd系统来管理kubernetes服务,并且journal系统会接管服务程序的输出日志,可以通过systemctl status
其中kubernetes组件包括:
| k8s组件 | 涉及日志内容 | 备注 |
|---|---|---|
| kube-apiserver | ||
| kube-controller-manager | Pod扩容相关或RC相关 | |
| kube-scheduler | Pod扩容相关或RC相关 | |
| kubelet | Pod生命周期相关:创建、停止等 | |
| etcd |
3.2. 日志文件
也可以通过指定日志存放目录来保存和查看日志
- --logtostderr=false:不输出到stderr
- --log-dir=/var/log/kubernetes:日志的存放目录
- --alsologtostderr=false:设置为true表示日志输出到文件也输出到stderr
- --v=0:glog的日志级别
- --vmodule=gfs*=2,test*=4:glog基于模块的详细日志级别
4. 常见问题
4.1. Pod状态一直为Pending
kubectl describe <pod_name> --namespace=<NAMESPACE>
查看该POD的事件。
- 正在下载镜像但拉取不下来(镜像拉取耗时太久)[一般都是该原因]
- 没有可用的Node可调度
- 开启了资源配额管理并且当前Pod的目标节点上恰好没有可用的资源
解决方法:
- 查看该POD所在宿主机与镜像仓库之间的网络是否有问题,可以手动拉取镜像
- 删除POD实例,让POD调度到别的宿主机上
4.2. Pod创建后不断重启
kubectl get pods中Pod状态一会running,一会不是,且RESTARTS次数不断增加。
一般原因为容器启动命令不是阻塞式命令,导致容器运行后马上退出。
非阻塞式命令:
- 本身CMD指定的命令就是非阻塞式命令
- 将服务启动方式设置为后台运行
解决方法:
1、将命令改为阻塞式命令(前台运行),例如:zkServer.sh start-foreground
2、java运行程序的启动脚本将 nohup xxx &的nobup和&去掉,例如:
nohup JAVA_HOME/bin/java JAVA_OPTS -cp $CLASSPATH com.cnc.open.processor.Main &
改为:
JAVA_HOME/bin/java JAVA_OPTS -cp $CLASSPATH com.cnc.open.processor.Main
文章参考《Kubernetes权威指南》
8.2 - kubectl工具
8.2.1 - kubectl安装与配置
1. kubectl的安装
curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
安装指定版本的kubectl,例如:v1.9.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.9.0/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
2. 配置k8s集群环境
2.1. 命令行方式
2.1.1 非安全方式
kubectl config set-cluster k8s --server=http://<url>
kubectl config set-context <NAMESPACE> --cluster=k8s --namespace=<NAMESPACE>
kubectl config use-context <NAMESPACE>
2.1.2 安全方式
kubectl config set-cluster k8s --server=https://<url> --insecure-skip-tls-verify=true
kubectl config set-credentials k8s-user --username=<username> --password=<password>
kubectl config set-context <NAMESPACE> --cluster=k8s --user=k8s-user --namespace=<NAMESPACE>
kubectl config use-context <NAMESPACE>
2.1.3 查询当前配置环境
[root@test ]# kubectl cluster-info
Kubernetes master is running at http://192.168.10.3:8081
2.2. 添加配置文件的方式
当没有指定 --kubeconfig参数和$KUBECONFIG的环境变量的时候,会默认读取${HOME}/.kube/config。
因此创建${HOME}/.kube/config文件,并在``${HOME}/.kube/ssl`目录下创建ca.pem、cert.pem、key.pem文件。
内容如下:
apiVersion: v1
kind: Config
clusters:
- name: local
cluster:
certificate-authority: ./ssl/ca.pem
server: https://192.168.10.3:6443
users:
- name: kubelet
user:
client-certificate: ./ssl/cert.pem
client-key: ./ssl/key.pem
contexts:
- context:
cluster: local
user: kubelet
name: kubelet-cluster.local
current-context: kubelet-cluster.local
3. kubectl config
kubectl config命令说明
$ kubectl config --help
Modify kubeconfig files using subcommands like "kubectl config set current-context my-context"
The loading order follows these rules:
1. If the --kubeconfig flag is set, then only that file is loaded. The flag may only be set once and no merging takes
place.
2. If $KUBECONFIG environment variable is set, then it is used a list of paths (normal path delimitting rules for your
system). These paths are merged. When a value is modified, it is modified in the file that defines the stanza. When a
value is created, it is created in the first file that exists. If no files in the chain exist, then it creates the last
file in the list.
3. Otherwise, ${HOME}/.kube/config is used and no merging takes place.
Available Commands:
current-context Displays the current-context
delete-cluster Delete the specified cluster from the kubeconfig
delete-context Delete the specified context from the kubeconfig
get-clusters Display clusters defined in the kubeconfig
get-contexts Describe one or many contexts
rename-context Renames a context from the kubeconfig file.
set Sets an individual value in a kubeconfig file
set-cluster Sets a cluster entry in kubeconfig
set-context Sets a context entry in kubeconfig
set-credentials Sets a user entry in kubeconfig
unset Unsets an individual value in a kubeconfig file
use-context Sets the current-context in a kubeconfig file
view Display merged kubeconfig settings or a specified kubeconfig file
Usage:
kubectl config SUBCOMMAND [options]
Use "kubectl <command> --help" for more information about a given command.
Use "kubectl options" for a list of global command-line options (applies to all commands).
4. shell自动补齐
source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc
如果出现以下报错
# kubectl自动补齐失败
kubectl _get_comp_words_by_ref : command not found
解决方法:
yum install bash-completion -y
source /etc/profile.d/bash_completion.sh
参考文章:
8.2.2 - kubectl命令说明
1. kubectl命令介绍
kubectl的命令语法
kubectl [command] [TYPE] [NAME] [flags]
其中command,TYPE,NAME,和flags分别是:
-
command: 指定要在一个或多个资源进行操作,例如create,get,describe,delete。 -
TYPE:指定资源类型。资源类型区分大小写,您可以指定单数,复数或缩写形式。例如,以下命令产生相同的输出:kubectl get pod pod1 kubectl get pods pod1 kubectl get po pod1 -
NAME:指定资源的名称。名称区分大小写。如果省略名称,则会显示所有资源的详细信息,比如$ kubectl get pods。按类型和名称指定多种资源:
* 要分组资源,如果它们都是相同的类型:`TYPE1 name1 name2 name<#>`.<br/> 例: `$ kubectl get pod example-pod1 example-pod2` * 要分别指定多种资源类型: `TYPE1/name1 TYPE1/name2 TYPE2/name3 TYPE<#>/name<#>`.<br/> 例: `$ kubectl get pod/example-pod1 replicationcontroller/example-rc1` -
flags:指定可选标志。例如,您可以使用-s或--serverflags来指定Kubernetes API服务器的地址和端口。
更多命令介绍:
[root@node5 ~]# kubectl
kubectl controls the Kubernetes cluster manager.
Find more information at https://github.com/kubernetes/kubernetes.
Basic Commands (Beginner):
create Create a resource from a file or from stdin.
expose Take a replication controller, service, deployment or pod and expose it as a new Kubernetes Service
run Run a particular image on the cluster
set Set specific features on objects
run-container Run a particular image on the cluster. This command is deprecated, use "run" instead
Basic Commands (Intermediate):
get Display one or many resources
explain Documentation of resources
edit Edit a resource on the server
delete Delete resources by filenames, stdin, resources and names, or by resources and label selector
Deploy Commands:
rollout Manage the rollout of a resource
rolling-update Perform a rolling update of the given ReplicationController
scale Set a new size for a Deployment, ReplicaSet, Replication Controller, or Job
autoscale Auto-scale a Deployment, ReplicaSet, or ReplicationController
Cluster Management Commands:
certificate Modify certificate resources.
cluster-info Display cluster info
top Display Resource (CPU/Memory/Storage) usage.
cordon Mark node as unschedulable
uncordon Mark node as schedulable
drain Drain node in preparation for maintenance
taint Update the taints on one or more nodes
Troubleshooting and Debugging Commands:
describe Show details of a specific resource or group of resources
logs Print the logs for a container in a pod
attach Attach to a running container
exec Execute a command in a container
port-forward Forward one or more local ports to a pod
proxy Run a proxy to the Kubernetes API server
cp Copy files and directories to and from containers.
auth Inspect authorization
Advanced Commands:
apply Apply a configuration to a resource by filename or stdin
patch Update field(s) of a resource using strategic merge patch
replace Replace a resource by filename or stdin
convert Convert config files between different API versions
Settings Commands:
label Update the labels on a resource
annotate Update the annotations on a resource
completion Output shell completion code for the specified shell (bash or zsh)
Other Commands:
api-versions Print the supported API versions on the server, in the form of "group/version"
config Modify kubeconfig files
help Help about any command
plugin Runs a command-line plugin
version Print the client and server version information
Use "kubectl <command> --help" for more information about a given command.
Use "kubectl options" for a list of global command-line options (applies to all commands).
2. 操作的常用资源对象
- Node
- Podes
- Replication Controllers
- Services
- Namespace
- Deployment
- StatefulSet
具体对象类型及缩写:
* all
* certificatesigningrequests (aka 'csr')
* clusterrolebindings
* clusterroles
* componentstatuses (aka 'cs')
* configmaps (aka 'cm')
* controllerrevisions
* cronjobs
* customresourcedefinition (aka 'crd')
* daemonsets (aka 'ds')
* deployments (aka 'deploy')
* endpoints (aka 'ep')
* events (aka 'ev')
* horizontalpodautoscalers (aka 'hpa')
* ingresses (aka 'ing')
* jobs
* limitranges (aka 'limits')
* namespaces (aka 'ns')
* networkpolicies (aka 'netpol')
* nodes (aka 'no')
* persistentvolumeclaims (aka 'pvc')
* persistentvolumes (aka 'pv')
* poddisruptionbudgets (aka 'pdb')
* podpreset
* pods (aka 'po')
* podsecuritypolicies (aka 'psp')
* podtemplates
* replicasets (aka 'rs')
* replicationcontrollers (aka 'rc')
* resourcequotas (aka 'quota')
* rolebindings
* roles
* secrets
* serviceaccounts (aka 'sa')
* services (aka 'svc')
* statefulsets (aka 'sts')
* storageclasses (aka 'sc')
3. kubectl命令分类[command]
3.1 增
1)create:[Create a resource by filename or stdin]
2)run:[ Run a particular image on the cluster]
3)apply:[Apply a configuration to a resource by filename or stdin]
4)proxy:[Run a proxy to the Kubernetes API server ]
3.2 删
1)delete:[Delete resources ]
3.3 改
1)scale:[Set a new size for a Replication Controller]
2)exec:[Execute a command in a container]
3)attach:[Attach to a running container]
4)patch:[Update field(s) of a resource by stdin]
5)edit:[Edit a resource on the server]
6) label:[Update the labels on a resource]
7)annotate:[Auto-scale a replication controller]
8)replace:[Replace a resource by filename or stdin]
9)config:[config modifies kubeconfig files]
3.4 查
1)get:[Display one or many resources]
2)describe:[Show details of a specific resource or group of resources]
3)log:[Print the logs for a container in a pod]
4)cluster-info:[Display cluster info]
5) version:[Print the client and server version information]
6)api-versions:[Print the supported API versions]
4. Pod相关命令
4.1 查询Pod
kubectl get pod -o wide --namespace=<NAMESPACE>
4.2 进入Pod
kubectl exec -it <PodName> /bin/bash --namespace=<NAMESPACE>
# 进入Pod中指定容器
kubectl exec -it <PodName> -c <ContainerName> /bin/bash --namespace=<NAMESPACE>
4.3 删除Pod
kubectl delete pod <PodName> --namespace=<NAMESPACE>
# 强制删除Pod,当Pod一直处于Terminating状态
kubectl delete pod <PodName> --namespace=<NAMESPACE> --force --grace-period=0
# 删除某个namespace下某个类型的所有对象
kubectl delete deploy --all --namespace=test
4.4 日志查看
$ 查看运行容器日志
kubectl logs <PodName> --namespace=<NAMESPACE>
$ 查看上一个挂掉的容器日志
kubectl logs <PodName> -p --namespace=<NAMESPACE>
5. 常用命令
5.1. Node隔离与恢复
说明:Node设置隔离之后,原先运行在该Node上的Pod不受影响,后续的Pod不会调度到被隔离的Node上。
1. Node隔离
# cordon命令
kubectl cordon <NodeName>
# 或者
kubectl patch node <NodeName> -p '{"spec":{"unschedulable":true}}'
2. Node恢复
# uncordon
kubectl uncordon <NodeName>
# 或者
kubectl patch node <NodeName> -p '{"spec":{"unschedulable":false}}'
5.2. kubectl label
1. 固定Pod到指定机器
kubectl label node <NodeName> namespace/<NAMESPACE>=true
2. 取消Pod固定机器
kubectl label node <NodeName> namespace/<NAMESPACE>-
5.3. 升级镜像
# 升级镜像
kubectl set image deployment/nginx nginx=nginx:1.15.12 -n nginx
# 查看滚动升级情况
kubectl rollout status deployment/nginx -n nginx
5.4. 调整资源值
# 调整指定容器的资源值
kubectl set resources sts nginx-0 -c=agent --limits=memory=512Mi -n nginx
5.5. 调整readiness probe
# 批量查看readiness probe timeoutSeconds
kubectl get statefulset -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].readinessProbe.timeoutSeconds}{"\n"}{end}'
# 调整readiness probe timeoutSeconds参数
kubectl patch statefulset nginx-sts --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/timeoutSeconds", "value":5}]' -n nginx
5.6. 调整tolerations属性
kubectl patch statefulset nginx-sts --patch '{"spec": {"template": {"spec": {"tolerations": [{"effect": "NoSchedule","key": "dedicated","operator": "Equal","value": "nginx"}]}}}}' -n nginx
5.7. 查看所有节点的IP
kubectl get nodes -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[0].address}{"\n"}{end}'
5.8. 查看当前k8s组件leader节点
当k8s集群高可用部署的时候,kube-controller-manager和kube-scheduler只能一个服务处于实际逻辑运行状态,通过参数--leader-elect=true来开启选举操作。以下提供查询leader节点的命令。
$ kubectl get endpoints kube-controller-manager --namespace=kube-system -o yaml
apiVersion: v1
kind: Endpoints
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"xxx.xxx.xxx.xxx_6537b938-7f5a-11e9-8487-00220d338975","leaseDurationSeconds":15,"acquireTime":"2019-05-26T02:03:18Z","renewTime":"2019-05-26T02:06:08Z","leaderTransitions":1}'
creationTimestamp: "2019-05-26T01:52:39Z"
name: kube-controller-manager
namespace: kube-system
resourceVersion: "1965"
selfLink: /api/v1/namespaces/kube-system/endpoints/kube-controller-manager
uid: f1755fc5-7f58-11e9-b4c4-00220d338975
以上表示"holderIdentity":"xxx.xxx.xxx.xxx为kube-controller-manager的leader节点。
同理,可以通过以下命令查看kube-scheduler的leader节点。
kubectl get endpoints kube-scheduler --namespace=kube-system -o yaml
5.9. 修改副本数
kubectl scale deployment.v1.apps/nginx-deployment --replicas=10
5.10. 批量删除pod
kubectl get po -n default |grep Evicted |awk '{print $1}' |xargs -I {} kubectl delete po {} -n default
5.11. 各种查看命令
# 不使用外部工具来输出解码后的 Secret
kubectl get secret my-secret -o go-template='{{range $k,$v := .data}}{{"### "}}{{$k}}{{"\n"}}{{$v|base64decode}}{{"\n\n"}}{{end}}'
# 列出事件(Events),按时间戳排序
kubectl get events --sort-by=.metadata.creationTimestamp
6. kubectl日志级别
Kubectl 日志输出详细程度是通过 -v 或者 --v 来控制的,参数后跟一个数字表示日志的级别。 Kubernetes 通用的日志习惯和相关的日志级别在 这里 有相应的描述。
| 详细程度 | 描述 |
|---|---|
--v=0 |
用于那些应该 始终 对运维人员可见的信息,因为这些信息一般很有用。 |
--v=1 |
如果您不想要看到冗余信息,此值是一个合理的默认日志级别。 |
--v=2 |
输出有关服务的稳定状态的信息以及重要的日志消息,这些信息可能与系统中的重大变化有关。这是建议大多数系统设置的默认日志级别。 |
--v=3 |
包含有关系统状态变化的扩展信息。 |
--v=4 |
包含调试级别的冗余信息。 |
--v=5 |
跟踪级别的详细程度。 |
--v=6 |
显示所请求的资源。 |
--v=7 |
显示 HTTP 请求头。 |
--v=8 |
显示 HTTP 请求内容。 |
--v=9 |
显示 HTTP 请求内容而且不截断内容。 |
参考文章:
8.2.3 - kubectl命令别名
1. kubectl-aliases
kubectl-aliases开源工具是由脚本通过拼接各种kubectl相关元素组成的alias命令别名列表,其中命令别名拼接元素如下:
| base | [system?] | [operation] | [resource] | [flags] |
|---|---|---|---|---|
kubectl |
-n=kube-system |
getdescribe rm:deletelogsexecapply |
podsdeploymentsecretingressnode svcns cm |
oyaml ojsonowide allwatchfilel |
k=kubectl- sys=
--namespace kube-system
- sys=
- commands:
- g=
get - d=
describe - rm=
delete - a:
apply -f - ex:
exec -i -t - lo:
logs -f
- g=
- resources:
- po=
pod - dep=
deployment - ing=
ingress - svc=
service - cm=
configmap - sec=
secret - ns=
namespace - no=
node
- po=
- flags:
- output format: oyaml, ojson, owide
- all:
--allor--all-namespacesdepending on the command - sl:
--show-labels - w=
-w/--watch
- value flags (should be at the end):
- f=
-f/--filename - l=
-l/--selector
- f=
2. 示例
# 示例1
kd → kubectl describe
# 示例2
kgdepallw → kubectl get deployment —all-namespaces —watch
alias get示例:
alias k='kubectl'
alias kg='kubectl get'
alias kgpo='kubectl get pods'
alias kgpoojson='kubectl get pods -o=json'
alias kgpon='kubectl get pods --namespace'
alias ksysgpooyamll='kubectl --namespace=kube-system get pods -o=yaml -l'
3. 安装
# 将 .kubectl_aliases下载到 home 目录
cd ~ && wget https://raw.githubusercontent.com/ahmetb/kubectl-aliases/master/.kubectl_aliases
# 将以下内容添加到 .bashrc中,并执行 source .bashrc
[ -f ~/.kubectl_aliases ] && source ~/.kubectl_aliases
function kubectl() { command kubectl $@; }
# 如果需要提示别名的完整命令,则将以下内容添加到 .bashrc中,并执行 source .bashrc
[ -f ~/.kubectl_aliases ] && source ~/.kubectl_aliases
function kubectl() { echo "+ kubectl $@"; command kubectl $@; }
参考:
8.3 - 节点调度
8.3.1 - 指定节点调度与隔离
1. NodeSelector
1.1. 概念
如果需要限制Pod到指定的Node上运行,则可以给Node打标签并给Pod配置NodeSelector。
1.2. 使用方式
1.2.1. 给Node打标签
# get node的name
kubectl get nodes
# 设置Label
kubectl label nodes <node-name> <label-key>=<label-value>
# 例如
kubectl label nodes node-1 disktype=ssd
# 查看Node的Label
kubectl get nodes --show-labels
# 删除Node的label
kubectl label node <node-name> <label-key>-
1.2.2. 给Pod设置NodeSelector
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd # 对应Node的Label
1.3. 亲和性(Affinity)和反亲和性(Anti-affinity)
待补充
2. Taint 和 Toleration
2.1. 概念
nodeSelector可以通过打标签的形式让Pod被调度到指定的Node上,Taint 则相反,它使节点能够排斥一类特定的Pod,除非Pod被指定了toleration的标签。(taint即污点,Node被打上污点;只有容忍[toleration]这些污点的Pod才可能被调度到该Node)。
2.2. 使用方式
2.2.1. kubectl taint
# 给节点增加一个taint,它的key是<key>,value是<value>,effect是NoSchedule。
kubectl taint nodes <node_name> <key>=<value>:NoSchedule
只有拥有和这个taint相匹配的toleration的pod才能够被分配到 node_name 这个节点。
例如,在 PodSpec 中定义 pod 的 toleration:
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
tolerations:
- key: "key"
operator: "Exists"
effect: "NoSchedule"
2.2.2. 匹配规则:
一个 toleration 和一个 taint 相“匹配”是指它们有一样的 key 和 effect ,并且:
- 如果
operator是Exists(此时 toleration 不能指定value) - 如果
operator是Equal,则它们的value应该相等
特殊情况:
-
如果一个 toleration 的
key为空且 operator 为Exists,表示这个 toleration 与任意的 key 、 value 和 effect 都匹配,即这个 toleration 能容忍任意 taint。tolerations: - operator: "Exists" -
如果一个 toleration 的
effect为空,则key值与之相同的相匹配 taint 的effect可以是任意值。tolerations: - key: "key" operator: "Exists"
一个节点可以设置多个taint,一个pod也可以设置多个toleration。Kubernetes 处理多个 taint 和 toleration 的过程就像一个过滤器:从一个节点的所有 taint 开始遍历,过滤掉那些 pod 中存在与之相匹配的 toleration 的 taint。余下未被过滤的 taint 的 effect 值决定了 pod 是否会被分配到该节点,特别是以下情况:
- 如果未被过滤的 taint 中存在一个以上 effect 值为
NoSchedule的 taint,则 Kubernetes 不会将 pod 分配到该节点。 - 如果未被过滤的 taint 中不存在 effect 值为
NoSchedule的 taint,但是存在 effect 值为PreferNoSchedule的 taint,则 Kubernetes 会尝试将 pod 分配到该节点。 - 如果未被过滤的 taint 中存在一个以上 effect 值为
NoExecute的 taint,则 Kubernetes 不会将 pod 分配到该节点(如果 pod 还未在节点上运行),或者将 pod 从该节点驱逐(如果 pod 已经在节点上运行)。
2.2.3. effect的类型
-
NoSchedule:只有拥有和这个 taint 相匹配的 toleration 的 pod 才能够被分配到这个节点。 -
PreferNoSchedule:系统会尽量避免将 pod 调度到存在其不能容忍 taint 的节点上,但这不是强制的。 -
NoExecute:任何不能忍受这个 taint 的 pod 都会马上被驱逐,任何可以忍受这个 taint 的 pod 都不会被驱逐。Pod可指定属性tolerationSeconds的值,表示pod 还能继续在节点上运行的时间。tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute" tolerationSeconds: 3600
2.3. 使用场景
2.3.1. 专用节点
kubectl taint nodes <nodename> dedicated=<groupName>:NoSchedule
先给Node添加taint,然后给Pod添加相对应的 toleration,则该Pod可调度到taint的Node,也可调度到其他节点。
如果想让Pod只调度某些节点且某些节点只接受对应的Pod,则需要在Node上添加Label(例如:dedicated=groupName),同时给Pod的nodeSelector添加对应的Label。
2.3.2. 特殊硬件节点
如果某些节点配置了特殊硬件(例如CPU),希望不使用这些特殊硬件的Pod不被调度该Node,以便保留必要资源。即可给Node设置taint和label,同时给Pod设置toleration和label来使得这些Node专门被指定Pod使用。
# kubectl taint
kubectl taint nodes nodename special=true:NoSchedule
# 或者
kubectl taint nodes nodename special=true:PreferNoSchedule
2.3.3. 基于taint驱逐
effect 值 NoExecute ,它会影响已经在节点上运行的 pod,即根据策略对Pod进行驱逐。
- 如果 pod 不能忍受effect 值为
NoExecute的 taint,那么 pod 将马上被驱逐 - 如果 pod 能够忍受effect 值为
NoExecute的 taint,但是在 toleration 定义中没有指定tolerationSeconds,则 pod 还会一直在这个节点上运行。 - 如果 pod 能够忍受effect 值为
NoExecute的 taint,而且指定了tolerationSeconds,则 pod 还能在这个节点上继续运行这个指定的时间长度。
参考:
8.3.2 - 安全迁移节点
1. 迁移Pod
1.1. 设置节点是否可调度
确定需要迁移和被迁移的节点,将不允许被迁移的节点设置为不可调度。
# 查看节点
kubectl get nodes
# 设置节点为不可调度
kubectl cordon <NodeName>
# 设置节点为可调度
kubectl uncordon <NodeName>
1.2. 执行kubectl drain命令
kubectl drain <NodeName> --force --ignore-daemonsets
示例:
$ kubectl drain bjzw-prek8sredis-99-40 --force --ignore-daemonsets
node "bjzw-prek8sredis-99-40" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: kube-proxy-bjzw-prek8sredis-99-40; Ignoring DaemonSet-managed pods: calicoopsmonitor-mfpqs, arachnia-agent-j56n8
pod "pre-test-pro2-r-0-redis-2-8-19-1" evicted
pod "pre-test-hwh1-r-8-redis-2-8-19-2" evicted
pod "pre-eos-hdfs-vector-eos-hdfs-redis-2-8-19-0" evicted
1.3. 特别说明
对于statefulset创建的Pod,kubectl drain的说明如下:
kubectl drain操作会将相应节点上的旧Pod删除,并在可调度节点上面起一个对应的Pod。当旧Pod没有被正常删除的情况下,新Pod不会起来。例如:旧Pod一直处于Terminating状态。
对应的解决方式是通过重启相应节点的kubelet,或者强制删除该Pod。
示例:
# 重启发生`Terminating`节点的kubelet
systemctl restart kubelet
# 强制删除`Terminating`状态的Pod
kubectl delete pod <PodName> --namespace=<Namespace> --force --grace-period=0
2. kubectl drain 流程图
3. TroubleShooting
1、存在不是通过ReplicationController, ReplicaSet, Job, DaemonSet 或者 StatefulSet创建的Pod(即静态pod,通过文件方式创建的),所以需要设置强制执行的参数--force。
$ kubectl drain bjzw-prek8sredis-99-40
node "bjzw-prek8sredis-99-40" already cordoned
error: unable to drain node "bjzw-prek8sredis-99-40", aborting command...
There are pending nodes to be drained:
bjzw-prek8sredis-99-40
error: DaemonSet-managed pods (use --ignore-daemonsets to ignore): calicoopsmonitor-mfpqs, arachnia-agent-j56n8; pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): kube-proxy-bjzw-prek8sredis-99-40
2、存在DaemonSet方式管理的Pod,需要设置--ignore-daemonsets参数忽略报错。
$ kubectl drain bjzw-prek8sredis-99-40 --force
node "bjzw-prek8sredis-99-40" already cordoned
error: unable to drain node "bjzw-prek8sredis-99-40", aborting command...
There are pending nodes to be drained:
bjzw-prek8sredis-99-40
error: DaemonSet-managed pods (use --ignore-daemonsets to ignore): calicoopsmonitor-mfpqs, arachnia-agent-j56n8
4. kubectl drain
$ kubectl drain --help
Drain node in preparation for maintenance.
The given node will be marked unschedulable to prevent new pods from arriving. 'drain' evicts the pods if the APIServer
supports eviction (http://kubernetes.io/docs/admin/disruptions/). Otherwise, it will use normal DELETE to delete the
pods. The 'drain' evicts or deletes all pods except mirror pods (which cannot be deleted through the API server). If
there are DaemonSet-managed pods, drain will not proceed without --ignore-daemonsets, and regardless it will not delete
any DaemonSet-managed pods, because those pods would be immediately replaced by the DaemonSet controller, which ignores
unschedulable markings. If there are any pods that are neither mirror pods nor managed by ReplicationController,
ReplicaSet, DaemonSet, StatefulSet or Job, then drain will not delete any pods unless you use --force. --force will
also allow deletion to proceed if the managing resource of one or more pods is missing.
'drain' waits for graceful termination. You should not operate on the machine until the command completes.
When you are ready to put the node back into service, use kubectl uncordon, which will make the node schedulable again.
! http://kubernetes.io/images/docs/kubectl_drain.svg
Examples:
# Drain node "foo", even if there are pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or
StatefulSet on it.
$ kubectl drain foo --force
# As above, but abort if there are pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or
StatefulSet, and use a grace period of 15 minutes.
$ kubectl drain foo --grace-period=900
Options:
--delete-local-data=false: Continue even if there are pods using emptyDir (local data that will be deleted when
the node is drained).
--dry-run=false: If true, only print the object that would be sent, without sending it.
--force=false: Continue even if there are pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet
or StatefulSet.
--grace-period=-1: Period of time in seconds given to each pod to terminate gracefully. If negative, the default
value specified in the pod will be used.
--ignore-daemonsets=false: Ignore DaemonSet-managed pods.
-l, --selector='': Selector (label query) to filter on
--timeout=0s: The length of time to wait before giving up, zero means infinite
Usage:
kubectl drain NODE [options]
Use "kubectl options" for a list of global command-line options (applies to all commands).
参考文档:
8.4 - 镜像仓库
8.4.1 - 配置私有镜像仓库
1. 镜像仓库的基本操作
1.1. 登录镜像仓库
docker login -u <username> -p <password> <registry-addr>
1.2. 拉取镜像
docker pull https://registry.xxx.com/dev/nginx:latest
1.3. 推送镜像
docker push https://registry.xxx.com/dev/nginx:latest
1.4. 重命名镜像
docker tag <old-image> <new-image>
2. docker.xxx.com镜像仓库
使用docker.xxx.com镜像仓库。
2.1. 所有节点配置insecure-registries
#cat /etc/docker/daemon.json
{
"data-root": "/data/docker",
"debug": false,
"insecure-registries": [
...
"docker.xxx.com:8080"
],
...
}
2.2. 所有节点配置/var/lib/kubelet/config.json
具体参考:configuring-nodes-to-authenticate-to-a-private-registry
- 在某个节点登录docker.xxx.com:8080镜像仓库,会更新 $HOME/.docker/config.json
- 检查$HOME/.docker/config.json是否有该镜像仓库的auth信息。
#cat ~/.docker/config.json
{
"auths": {
"docker.xxx.com:8080": {
"auth": "<此处为凭证信息>"
}
},
"HttpHeaders": {
"User-Agent": "Docker-Client/18.09.9 (linux)"
}
}
- 将
$HOME/.docker/config.json拷贝到所有的Node节点上的/var/lib/kubelet/config.json。
# 获取所有节点的IP
nodes=$(kubectl get nodes -o jsonpath='{range .items[*].status.addresses[?(@.type=="ExternalIP")]}{.address} {end}')
# 拷贝到所有节点
for n in $nodes; do scp ~/.docker/config.json root@$n:/var/lib/kubelet/config.json; done
2.3. 创建docker.xxx.com镜像的pod
指定镜像为:docker.xxx.com:8080/public/2048:latest
完整pod.yaml
apiVersion: apps/v1beta2
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
generation: 1
labels:
k8s-app: dockeroa-hub
qcloud-app: dockeroa-hub
name: dockeroa-hub
namespace: test
spec:
progressDeadlineSeconds: 600
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: dockeroa-hub
qcloud-app: dockeroa-hub
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
k8s-app: dockeroa-hub
qcloud-app: dockeroa-hub
spec:
containers:
- image: docker.xxx.com:8080/public/2048:latest
imagePullPolicy: Always
name: game
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 256Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
nodeName: 192.168.1.1
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
查看pod状态
#kgpoowide -n game
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
docker-oa-757bbbddb5-h6j7m 1/1 Running 0 14m 192.168.2.51 192.168.1.1 <none> <none>
docker-oa-757bbbddb5-jp5dw 1/1 Running 0 14m 192.168.1.32 192.168.1.2 <none> <none>
docker-oa-757bbbddb5-nlw9f 1/1 Running 0 14m 192.168.0.43 192.168.1.3 <none> <none>
参考:
8.4.2 - 拉取私有镜像
本文介绍通过pod指定 ImagePullSecrets来拉取私有镜像仓库的镜像
1. 创建secret
secret是namespace级别的,创建时候需要指定namespace。
kubectl create secret docker-registry <name> --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD -n <NAMESPACE>
2. 添加ImagePullSecrets到serviceAccount
可以通过将ImagePullSecrets到serviceAccount的方式来自动给pod添加imagePullSecrets参数值。
serviceAccount同样是namespace级别,只对该namespace生效。
#kubectl get secrets -n dev
NAME TYPE DATA AGE
docker.xxxx.com kubernetes.io/dockerconfigjson 1 6h23m
将ImagePullSecrets添加到serviceAccount对象中。
默认serviceAccount对象如下
#kubectl get serviceaccount default -n dev -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
creationTimestamp: "2020-02-27T03:30:38Z"
name: default
namespace: dev
resourceVersion: "11651567"
selfLink: /api/v1/namespaces/dev/serviceaccounts/default
uid: 85bcdd31-5911-11ea-9429-6c92bf3b7c33
secrets:
- name: default-token-s7wfn
编辑或修改serviceAccount内容,增加imagePullSecrets字段。
imagePullSecrets:
- name: docker.xxxx.com
kubectl edit serviceaccount default -n dev
修改后内容为:
apiVersion: v1
kind: ServiceAccount
metadata:
creationTimestamp: "2020-02-27T03:30:38Z"
name: default
namespace: dev
resourceVersion: "11651567"
selfLink: /api/v1/namespaces/dev/serviceaccounts/default
uid: 85bcdd31-5911-11ea-9429-6c92bf3b7c33
secrets:
- name: default-token-s7wfn
imagePullSecrets:
- name: docker.xxxx.com
3. 创建带有imagePullSecrets的pod
如果已经执行了第二步操作,添加ImagePullSecrets到serviceAccount,则无需在pod中指定imagePullSecrets参数,默认会自动添加。
如果没有添加ImagePullSecrets到serviceAccount,则在pod中指定imagePullSecrets参数引用创建的镜像仓库的secret。
spec:
imagePullSecrets:
- name: docker.xxxx.com
4. 说明
由于secret和serviceaccount对象是对namespace级别生效,因此不同的namespace需要再次创建和更新这两个对象。该场景适合不同用户具有独立的镜像仓库的密码,可以通过该方式创建不同的镜像密码使用的secret来拉取不同的镜像部署。
参考:
9 - 开发指南
9.1 - client-go的使用及源码分析
1. client-go简介
1.1 client-go说明
client-go是一个调用kubernetes集群资源对象API的客户端,即通过client-go实现对kubernetes集群中资源对象(包括deployment、service、ingress、replicaSet、pod、namespace、node等)的增删改查等操作。大部分对kubernetes进行前置API封装的二次开发都通过client-go这个第三方包来实现。
client-go官方文档:https://github.com/kubernetes/client-go
1.2 示例代码
git clone https://github.com/huweihuang/client-go.git
cd client-go
#保证本地HOME目录有配置kubernetes集群的配置文件
go run client-go.go
package main
import (
"flag"
"fmt"
"os"
"path/filepath"
"time"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
)
func main() {
var kubeconfig *string
if home := homeDir(); home != "" {
kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
} else {
kubeconfig = flag.String("kubeconfig", "", "absolute path to the kubeconfig file")
}
flag.Parse()
// uses the current context in kubeconfig
config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
if err != nil {
panic(err.Error())
}
// creates the clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
panic(err.Error())
}
for {
pods, err := clientset.CoreV1().Pods("").List(metav1.ListOptions{})
if err != nil {
panic(err.Error())
}
fmt.Printf("There are %d pods in the cluster\n", len(pods.Items))
time.Sleep(10 * time.Second)
}
}
func homeDir() string {
if h := os.Getenv("HOME"); h != "" {
return h
}
return os.Getenv("USERPROFILE") // windows
}
1.3 运行结果
➜ go run client-go.go
There are 9 pods in the cluster
There are 7 pods in the cluster
There are 7 pods in the cluster
There are 7 pods in the cluster
There are 7 pods in the cluster
2. client-go源码分析
client-go源码:https://github.com/kubernetes/client-go
client-go源码目录结构
- The
kubernetespackage contains the clientset to access Kubernetes API. - The
discoverypackage is used to discover APIs supported by a Kubernetes API server. - The
dynamicpackage contains a dynamic client that can perform generic operations on arbitrary Kubernetes API objects. - The
transportpackage is used to set up auth and start a connection. - The
tools/cachepackage is useful for writing controllers.
2.1 kubeconfig
kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
获取kubernetes配置文件kubeconfig的绝对路径。一般路径为$HOME/.kube/config。该文件主要用来配置本地连接的kubernetes集群。
config内容如下:
apiVersion: v1
clusters:
- cluster:
server: http://<kube-master-ip>:8080
name: k8s
contexts:
- context:
cluster: k8s
namespace: default
user: ""
name: default
current-context: default
kind: Config
preferences: {}
users: []
2.2 rest.config
通过参数(master的url或者kubeconfig路径)和BuildConfigFromFlags方法来获取rest.Config对象,一般是通过参数kubeconfig的路径。
config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
BuildConfigFromFlags函数源码
k8s.io/client-go/tools/clientcmd/client_config.go
// BuildConfigFromFlags is a helper function that builds configs from a master
// url or a kubeconfig filepath. These are passed in as command line flags for cluster
// components. Warnings should reflect this usage. If neither masterUrl or kubeconfigPath
// are passed in we fallback to inClusterConfig. If inClusterConfig fails, we fallback
// to the default config.
func BuildConfigFromFlags(masterUrl, kubeconfigPath string) (*restclient.Config, error) {
if kubeconfigPath == "" && masterUrl == "" {
glog.Warningf("Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.")
kubeconfig, err := restclient.InClusterConfig()
if err == nil {
return kubeconfig, nil
}
glog.Warning("error creating inClusterConfig, falling back to default config: ", err)
}
return NewNonInteractiveDeferredLoadingClientConfig(
&ClientConfigLoadingRules{ExplicitPath: kubeconfigPath},
&ConfigOverrides{ClusterInfo: clientcmdapi.Cluster{Server: masterUrl}}).ClientConfig()
}
2.3 clientset
通过*rest.Config参数和NewForConfig方法来获取clientset对象,clientset是多个client的集合,每个client可能包含不同版本的方法调用。
clientset, err := kubernetes.NewForConfig(config)
2.3.1 NewForConfig
NewForConfig函数就是初始化clientset中的每个client。
k8s.io/client-go/kubernetes/clientset.go
// NewForConfig creates a new Clientset for the given config.
func NewForConfig(c *rest.Config) (*Clientset, error) {
configShallowCopy := *c
...
var cs Clientset
cs.appsV1beta1, err = appsv1beta1.NewForConfig(&configShallowCopy)
...
cs.coreV1, err = corev1.NewForConfig(&configShallowCopy)
...
}
2.3.2 clientset的结构体
k8s.io/client-go/kubernetes/clientset.go
// Clientset contains the clients for groups. Each group has exactly one
// version included in a Clientset.
type Clientset struct {
*discovery.DiscoveryClient
admissionregistrationV1alpha1 *admissionregistrationv1alpha1.AdmissionregistrationV1alpha1Client
appsV1beta1 *appsv1beta1.AppsV1beta1Client
appsV1beta2 *appsv1beta2.AppsV1beta2Client
authenticationV1 *authenticationv1.AuthenticationV1Client
authenticationV1beta1 *authenticationv1beta1.AuthenticationV1beta1Client
authorizationV1 *authorizationv1.AuthorizationV1Client
authorizationV1beta1 *authorizationv1beta1.AuthorizationV1beta1Client
autoscalingV1 *autoscalingv1.AutoscalingV1Client
autoscalingV2beta1 *autoscalingv2beta1.AutoscalingV2beta1Client
batchV1 *batchv1.BatchV1Client
batchV1beta1 *batchv1beta1.BatchV1beta1Client
batchV2alpha1 *batchv2alpha1.BatchV2alpha1Client
certificatesV1beta1 *certificatesv1beta1.CertificatesV1beta1Client
coreV1 *corev1.CoreV1Client
extensionsV1beta1 *extensionsv1beta1.ExtensionsV1beta1Client
networkingV1 *networkingv1.NetworkingV1Client
policyV1beta1 *policyv1beta1.PolicyV1beta1Client
rbacV1 *rbacv1.RbacV1Client
rbacV1beta1 *rbacv1beta1.RbacV1beta1Client
rbacV1alpha1 *rbacv1alpha1.RbacV1alpha1Client
schedulingV1alpha1 *schedulingv1alpha1.SchedulingV1alpha1Client
settingsV1alpha1 *settingsv1alpha1.SettingsV1alpha1Client
storageV1beta1 *storagev1beta1.StorageV1beta1Client
storageV1 *storagev1.StorageV1Client
}
2.3.3 clientset.Interface
clientset实现了以下的Interface,因此可以通过调用以下方法获得具体的client。例如:
pods, err := clientset.CoreV1().Pods("").List(metav1.ListOptions{})
clientset的方法集接口
k8s.io/client-go/kubernetes/clientset.go
type Interface interface {
Discovery() discovery.DiscoveryInterface
AdmissionregistrationV1alpha1() admissionregistrationv1alpha1.AdmissionregistrationV1alpha1Interface
// Deprecated: please explicitly pick a version if possible.
Admissionregistration() admissionregistrationv1alpha1.AdmissionregistrationV1alpha1Interface
AppsV1beta1() appsv1beta1.AppsV1beta1Interface
AppsV1beta2() appsv1beta2.AppsV1beta2Interface
// Deprecated: please explicitly pick a version if possible.
Apps() appsv1beta2.AppsV1beta2Interface
AuthenticationV1() authenticationv1.AuthenticationV1Interface
// Deprecated: please explicitly pick a version if possible.
Authentication() authenticationv1.AuthenticationV1Interface
AuthenticationV1beta1() authenticationv1beta1.AuthenticationV1beta1Interface
AuthorizationV1() authorizationv1.AuthorizationV1Interface
// Deprecated: please explicitly pick a version if possible.
Authorization() authorizationv1.AuthorizationV1Interface
AuthorizationV1beta1() authorizationv1beta1.AuthorizationV1beta1Interface
AutoscalingV1() autoscalingv1.AutoscalingV1Interface
// Deprecated: please explicitly pick a version if possible.
Autoscaling() autoscalingv1.AutoscalingV1Interface
AutoscalingV2beta1() autoscalingv2beta1.AutoscalingV2beta1Interface
BatchV1() batchv1.BatchV1Interface
// Deprecated: please explicitly pick a version if possible.
Batch() batchv1.BatchV1Interface
BatchV1beta1() batchv1beta1.BatchV1beta1Interface
BatchV2alpha1() batchv2alpha1.BatchV2alpha1Interface
CertificatesV1beta1() certificatesv1beta1.CertificatesV1beta1Interface
// Deprecated: please explicitly pick a version if possible.
Certificates() certificatesv1beta1.CertificatesV1beta1Interface
CoreV1() corev1.CoreV1Interface
// Deprecated: please explicitly pick a version if possible.
Core() corev1.CoreV1Interface
ExtensionsV1beta1() extensionsv1beta1.ExtensionsV1beta1Interface
// Deprecated: please explicitly pick a version if possible.
Extensions() extensionsv1beta1.ExtensionsV1beta1Interface
NetworkingV1() networkingv1.NetworkingV1Interface
// Deprecated: please explicitly pick a version if possible.
Networking() networkingv1.NetworkingV1Interface
PolicyV1beta1() policyv1beta1.PolicyV1beta1Interface
// Deprecated: please explicitly pick a version if possible.
Policy() policyv1beta1.PolicyV1beta1Interface
RbacV1() rbacv1.RbacV1Interface
// Deprecated: please explicitly pick a version if possible.
Rbac() rbacv1.RbacV1Interface
RbacV1beta1() rbacv1beta1.RbacV1beta1Interface
RbacV1alpha1() rbacv1alpha1.RbacV1alpha1Interface
SchedulingV1alpha1() schedulingv1alpha1.SchedulingV1alpha1Interface
// Deprecated: please explicitly pick a version if possible.
Scheduling() schedulingv1alpha1.SchedulingV1alpha1Interface
SettingsV1alpha1() settingsv1alpha1.SettingsV1alpha1Interface
// Deprecated: please explicitly pick a version if possible.
Settings() settingsv1alpha1.SettingsV1alpha1Interface
StorageV1beta1() storagev1beta1.StorageV1beta1Interface
StorageV1() storagev1.StorageV1Interface
// Deprecated: please explicitly pick a version if possible.
Storage() storagev1.StorageV1Interface
}
2.4 CoreV1Client
我们以clientset中的CoreV1Client为例做分析。
通过传入的配置信息rest.Config初始化CoreV1Client对象。
k8s.io/client-go/kubernetes/clientset.go
cs.coreV1, err = corev1.NewForConfig(&configShallowCopy)
2.4.1 corev1.NewForConfig
k8s.io/client-go/kubernetes/typed/core/v1/core_client.go
// NewForConfig creates a new CoreV1Client for the given config.
func NewForConfig(c *rest.Config) (*CoreV1Client, error) {
config := *c
if err := setConfigDefaults(&config); err != nil {
return nil, err
}
client, err := rest.RESTClientFor(&config)
if err != nil {
return nil, err
}
return &CoreV1Client{client}, nil
}
corev1.NewForConfig方法本质是调用了rest.RESTClientFor(&config)方法创建RESTClient对象,即CoreV1Client的本质就是一个RESTClient对象。
2.4.2 CoreV1Client结构体
以下是CoreV1Client结构体的定义:
k8s.io/client-go/kubernetes/typed/core/v1/core_client.go
// CoreV1Client is used to interact with features provided by the group.
type CoreV1Client struct {
restClient rest.Interface
}
CoreV1Client实现了CoreV1Interface的接口,即以下方法,从而对kubernetes的资源对象进行增删改查的操作。
k8s.io/client-go/kubernetes/typed/core/v1/core_client.go
//CoreV1Client的方法
func (c *CoreV1Client) ComponentStatuses() ComponentStatusInterface {...}
//ConfigMaps
func (c *CoreV1Client) ConfigMaps(namespace string) ConfigMapInterface {...}
//Endpoints
func (c *CoreV1Client) Endpoints(namespace string) EndpointsInterface {...}
func (c *CoreV1Client) Events(namespace string) EventInterface {...}
func (c *CoreV1Client) LimitRanges(namespace string) LimitRangeInterface {...}
//Namespaces
func (c *CoreV1Client) Namespaces() NamespaceInterface {...}
//Nodes
func (c *CoreV1Client) Nodes() NodeInterface {...}
func (c *CoreV1Client) PersistentVolumes() PersistentVolumeInterface {...}
func (c *CoreV1Client) PersistentVolumeClaims(namespace string) PersistentVolumeClaimInterface {...}
//Pods
func (c *CoreV1Client) Pods(namespace string) PodInterface {...}
func (c *CoreV1Client) PodTemplates(namespace string) PodTemplateInterface {...}
//ReplicationControllers
func (c *CoreV1Client) ReplicationControllers(namespace string) ReplicationControllerInterface {...}
func (c *CoreV1Client) ResourceQuotas(namespace string) ResourceQuotaInterface {...}
func (c *CoreV1Client) Secrets(namespace string) SecretInterface {...}
//Services
func (c *CoreV1Client) Services(namespace string) ServiceInterface {...}
func (c *CoreV1Client) ServiceAccounts(namespace string) ServiceAccountInterface {...}
2.4.3 CoreV1Interface
k8s.io/client-go/kubernetes/typed/core/v1/core_client.go
type CoreV1Interface interface {
RESTClient() rest.Interface
ComponentStatusesGetter
ConfigMapsGetter
EndpointsGetter
EventsGetter
LimitRangesGetter
NamespacesGetter
NodesGetter
PersistentVolumesGetter
PersistentVolumeClaimsGetter
PodsGetter
PodTemplatesGetter
ReplicationControllersGetter
ResourceQuotasGetter
SecretsGetter
ServicesGetter
ServiceAccountsGetter
}
CoreV1Interface中包含了各种kubernetes对象的调用接口,例如PodsGetter是对kubernetes中pod对象增删改查操作的接口。ServicesGetter是对service对象的操作的接口。
2.4.4 PodsGetter
以下我们以PodsGetter接口为例分析CoreV1Client对pod对象的增删改查接口调用。
示例中的代码如下:
pods, err := clientset.CoreV1().Pods("").List(metav1.ListOptions{})
CoreV1().Pods()
k8s.io/client-go/kubernetes/typed/core/v1/core_client.go
func (c *CoreV1Client) Pods(namespace string) PodInterface {
return newPods(c, namespace)
}
newPods()
k8s.io/client-go/kubernetes/typed/core/v1/pod.go
// newPods returns a Pods
func newPods(c *CoreV1Client, namespace string) *pods {
return &pods{
client: c.RESTClient(),
ns: namespace,
}
}
CoreV1().Pods()的方法实际上是调用了newPods()的方法,创建了一个pods对象,pods对象继承了rest.Interface接口,即最终的实现本质是RESTClient的HTTP调用。
k8s.io/client-go/kubernetes/typed/core/v1/pod.go
// pods implements PodInterface
type pods struct {
client rest.Interface
ns string
}
pods对象实现了PodInterface接口。PodInterface定义了pods对象的增删改查等方法。
k8s.io/client-go/kubernetes/typed/core/v1/pod.go
// PodInterface has methods to work with Pod resources.
type PodInterface interface {
Create(*v1.Pod) (*v1.Pod, error)
Update(*v1.Pod) (*v1.Pod, error)
UpdateStatus(*v1.Pod) (*v1.Pod, error)
Delete(name string, options *meta_v1.DeleteOptions) error
DeleteCollection(options *meta_v1.DeleteOptions, listOptions meta_v1.ListOptions) error
Get(name string, options meta_v1.GetOptions) (*v1.Pod, error)
List(opts meta_v1.ListOptions) (*v1.PodList, error)
Watch(opts meta_v1.ListOptions) (watch.Interface, error)
Patch(name string, pt types.PatchType, data []byte, subresources ...string) (result *v1.Pod, err error)
PodExpansion
}
PodsGetter
PodsGetter继承了PodInterface的接口。
k8s.io/client-go/kubernetes/typed/core/v1/pod.go
// PodsGetter has a method to return a PodInterface.
// A group's client should implement this interface.
type PodsGetter interface {
Pods(namespace string) PodInterface
}
Pods().List()
pods.List()方法通过RESTClient的HTTP调用来实现对kubernetes的pod资源的获取。
k8s.io/client-go/kubernetes/typed/core/v1/pod.go
// List takes label and field selectors, and returns the list of Pods that match those selectors.
func (c *pods) List(opts meta_v1.ListOptions) (result *v1.PodList, err error) {
result = &v1.PodList{}
err = c.client.Get().
Namespace(c.ns).
Resource("pods").
VersionedParams(&opts, scheme.ParameterCodec).
Do().
Into(result)
return
}
以上分析了clientset.CoreV1().Pods("").List(metav1.ListOptions{})对pod资源获取的过程,最终是调用RESTClient的方法实现。
2.5 RESTClient
以下分析RESTClient的创建过程及作用。
RESTClient对象的创建同样是依赖传入的config信息。
k8s.io/client-go/kubernetes/typed/core/v1/core_client.go
client, err := rest.RESTClientFor(&config)
2.5.1 rest.RESTClientFor
k8s.io/client-go/rest/config.go
// RESTClientFor returns a RESTClient that satisfies the requested attributes on a client Config
// object. Note that a RESTClient may require fields that are optional when initializing a Client.
// A RESTClient created by this method is generic - it expects to operate on an API that follows
// the Kubernetes conventions, but may not be the Kubernetes API.
func RESTClientFor(config *Config) (*RESTClient, error) {
...
qps := config.QPS
...
burst := config.Burst
...
baseURL, versionedAPIPath, err := defaultServerUrlFor(config)
...
transport, err := TransportFor(config)
...
var httpClient *http.Client
if transport != http.DefaultTransport {
httpClient = &http.Client{Transport: transport}
if config.Timeout > 0 {
httpClient.Timeout = config.Timeout
}
}
return NewRESTClient(baseURL, versionedAPIPath, config.ContentConfig, qps, burst, config.RateLimiter, httpClient)
}
RESTClientFor函数调用了NewRESTClient的初始化函数。
2.5.2 NewRESTClient
k8s.io/client-go/rest/client.go
// NewRESTClient creates a new RESTClient. This client performs generic REST functions
// such as Get, Put, Post, and Delete on specified paths. Codec controls encoding and
// decoding of responses from the server.
func NewRESTClient(baseURL *url.URL, versionedAPIPath string, config ContentConfig, maxQPS float32, maxBurst int, rateLimiter flowcontrol.RateLimiter, client *http.Client) (*RESTClient, error) {
base := *baseURL
...
serializers, err := createSerializers(config)
...
return &RESTClient{
base: &base,
versionedAPIPath: versionedAPIPath,
contentConfig: config,
serializers: *serializers,
createBackoffMgr: readExpBackoffConfig,
Throttle: throttle,
Client: client,
}, nil
}
2.5.3 RESTClient结构体
以下介绍RESTClient的结构体定义,RESTClient结构体中包含了http.Client,即本质上RESTClient就是一个http.Client的封装实现。
k8s.io/client-go/rest/client.go
// RESTClient imposes common Kubernetes API conventions on a set of resource paths.
// The baseURL is expected to point to an HTTP or HTTPS path that is the parent
// of one or more resources. The server should return a decodable API resource
// object, or an api.Status object which contains information about the reason for
// any failure.
//
// Most consumers should use client.New() to get a Kubernetes API client.
type RESTClient struct {
// base is the root URL for all invocations of the client
base *url.URL
// versionedAPIPath is a path segment connecting the base URL to the resource root
versionedAPIPath string
// contentConfig is the information used to communicate with the server.
contentConfig ContentConfig
// serializers contain all serializers for underlying content type.
serializers Serializers
// creates BackoffManager that is passed to requests.
createBackoffMgr func() BackoffManager
// TODO extract this into a wrapper interface via the RESTClient interface in kubectl.
Throttle flowcontrol.RateLimiter
// Set specific behavior of the client. If not set http.DefaultClient will be used.
Client *http.Client
}
2.5.4 RESTClient.Interface
RESTClient实现了以下的接口方法:
k8s.io/client-go/rest/client.go
// Interface captures the set of operations for generically interacting with Kubernetes REST apis.
type Interface interface {
GetRateLimiter() flowcontrol.RateLimiter
Verb(verb string) *Request
Post() *Request
Put() *Request
Patch(pt types.PatchType) *Request
Get() *Request
Delete() *Request
APIVersion() schema.GroupVersion
}
在调用HTTP方法(Post(),Put(),Get(),Delete() )时,实际上调用了Verb(verb string)函数。
k8s.io/client-go/rest/client.go
// Verb begins a request with a verb (GET, POST, PUT, DELETE).
//
// Example usage of RESTClient's request building interface:
// c, err := NewRESTClient(...)
// if err != nil { ... }
// resp, err := c.Verb("GET").
// Path("pods").
// SelectorParam("labels", "area=staging").
// Timeout(10*time.Second).
// Do()
// if err != nil { ... }
// list, ok := resp.(*api.PodList)
//
func (c *RESTClient) Verb(verb string) *Request {
backoff := c.createBackoffMgr()
if c.Client == nil {
return NewRequest(nil, verb, c.base, c.versionedAPIPath, c.contentConfig, c.serializers, backoff, c.Throttle)
}
return NewRequest(c.Client, verb, c.base, c.versionedAPIPath, c.contentConfig, c.serializers, backoff, c.Throttle)
}
Verb函数调用了NewRequest方法,最后调用Do()方法实现一个HTTP请求获取Result。
2.6 总结
client-go对kubernetes资源对象的调用,需要先获取kubernetes的配置信息,即$HOME/.kube/config。
整个调用的过程如下:
kubeconfig→rest.config→clientset→具体的client(CoreV1Client)→具体的资源对象(pod)→RESTClient→http.Client→HTTP请求的发送及响应
通过clientset中不同的client和client中不同资源对象的方法实现对kubernetes中资源对象的增删改查等操作,常用的client有CoreV1Client、AppsV1beta1Client、ExtensionsV1beta1Client等。
3. client-go对k8s资源的调用
创建clientset
//获取kubeconfig
kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
//创建config
config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
//创建clientset
clientset, err := kubernetes.NewForConfig(config)
//具体的资源调用见以下例子
3.1 deployment
//声明deployment对象
var deployment *v1beta1.Deployment
//构造deployment对象
//创建deployment
deployment, err := clientset.AppsV1beta1().Deployments(<namespace>).Create(<deployment>)
//更新deployment
deployment, err := clientset.AppsV1beta1().Deployments(<namespace>).Update(<deployment>)
//删除deployment
err := clientset.AppsV1beta1().Deployments(<namespace>).Delete(<deployment.Name>, &meta_v1.DeleteOptions{})
//查询deployment
deployment, err := clientset.AppsV1beta1().Deployments(<namespace>).Get(<deployment.Name>, meta_v1.GetOptions{})
//列出deployment
deploymentList, err := clientset.AppsV1beta1().Deployments(<namespace>).List(&meta_v1.ListOptions{})
//watch deployment
watchInterface, err := clientset.AppsV1beta1().Deployments(<namespace>).Watch(&meta_v1.ListOptions{})
3.2 service
//声明service对象
var service *v1.Service
//构造service对象
//创建service
service, err := clientset.CoreV1().Services(<namespace>).Create(<service>)
//更新service
service, err := clientset.CoreV1().Services(<namespace>).Update(<service>)
//删除service
err := clientset.CoreV1().Services(<namespace>).Delete(<service.Name>, &meta_v1.DeleteOptions{})
//查询service
service, err := clientset.CoreV1().Services(<namespace>).Get(<service.Name>, meta_v1.GetOptions{})
//列出service
serviceList, err := clientset.CoreV1().Services(<namespace>).List(&meta_v1.ListOptions{})
//watch service
watchInterface, err := clientset.CoreV1().Services(<namespace>).Watch(&meta_v1.ListOptions{})
3.3 ingress
//声明ingress对象
var ingress *v1beta1.Ingress
//构造ingress对象
//创建ingress
ingress, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Create(<ingress>)
//更新ingress
ingress, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Update(<ingress>)
//删除ingress
err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Delete(<ingress.Name>, &meta_v1.DeleteOptions{})
//查询ingress
ingress, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Get(<ingress.Name>, meta_v1.GetOptions{})
//列出ingress
ingressList, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).List(&meta_v1.ListOptions{})
//watch ingress
watchInterface, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Watch(&meta_v1.ListOptions{})
3.4 replicaSet
//声明replicaSet对象
var replicaSet *v1beta1.ReplicaSet
//构造replicaSet对象
//创建replicaSet
replicaSet, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Create(<replicaSet>)
//更新replicaSet
replicaSet, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Update(<replicaSet>)
//删除replicaSet
err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Delete(<replicaSet.Name>, &meta_v1.DeleteOptions{})
//查询replicaSet
replicaSet, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Get(<replicaSet.Name>, meta_v1.GetOptions{})
//列出replicaSet
replicaSetList, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).List(&meta_v1.ListOptions{})
//watch replicaSet
watchInterface, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Watch(&meta_v1.ListOptions{})
新版的kubernetes中一般通过deployment来创建replicaSet,再通过replicaSet来控制pod。
3.5 pod
//声明pod对象
var pod *v1.Pod
//创建pod
pod, err := clientset.CoreV1().Pods(<namespace>).Create(<pod>)
//更新pod
pod, err := clientset.CoreV1().Pods(<namespace>).Update(<pod>)
//删除pod
err := clientset.CoreV1().Pods(<namespace>).Delete(<pod.Name>, &meta_v1.DeleteOptions{})
//查询pod
pod, err := clientset.CoreV1().Pods(<namespace>).Get(<pod.Name>, meta_v1.GetOptions{})
//列出pod
podList, err := clientset.CoreV1().Pods(<namespace>).List(&meta_v1.ListOptions{})
//watch pod
watchInterface, err := clientset.CoreV1().Pods(<namespace>).Watch(&meta_v1.ListOptions{})
3.6 statefulset
//声明statefulset对象
var statefulset *v1.StatefulSet
//创建statefulset
statefulset, err := clientset.AppsV1().StatefulSets(<namespace>).Create(<statefulset>)
//更新statefulset
statefulset, err := clientset.AppsV1().StatefulSets(<namespace>).Update(<statefulset>)
//删除statefulset
err := clientset.AppsV1().StatefulSets(<namespace>).Delete(<statefulset.Name>, &meta_v1.DeleteOptions{})
//查询statefulset
statefulset, err := clientset.AppsV1().StatefulSets(<namespace>).Get(<statefulset.Name>, meta_v1.GetOptions{})
//列出statefulset
statefulsetList, err := clientset.AppsV1().StatefulSets(<namespace>).List(&meta_v1.ListOptions{})
//watch statefulset
watchInterface, err := clientset.AppsV1().StatefulSets(<namespace>).Watch(&meta_v1.ListOptions{})
通过以上对kubernetes的资源对象的操作函数可以看出,每个资源对象都有增删改查等方法,基本调用逻辑类似。一般二次开发只需要创建deployment、service、ingress三个资源对象即可,pod对象由deployment包含的replicaSet来控制创建和删除。函数调用的入参一般只有NAMESPACE和kubernetesObject两个参数,部分操作有Options的参数。在创建前,需要对资源对象构造数据,可以理解为编辑一个资源对象的yaml文件,然后通过kubectl create -f xxx.yaml来创建对象。
参考文档:
9.2 - operator开发
9.2.1 - kubebuilder的使用
1. kubebuilder
1.1. 安装kubebuilder
# download kubebuilder and install locally.
curl -L -o kubebuilder https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)
chmod +x kubebuilder && mv kubebuilder /usr/local/bin/
1.2. kubebuilder命令
Development kit for building Kubernetes extensions and tools.
Provides libraries and tools to create new projects, APIs and controllers.
Includes tools for packaging artifacts into an installer container.
Typical project lifecycle:
- initialize a project:
kubebuilder init --domain example.com --license apache2 --owner "The Kubernetes authors"
- create one or more a new resource APIs and add your code to them:
kubebuilder create api --group <group> --version <version> --kind <Kind>
Create resource will prompt the user for if it should scaffold the Resource and / or Controller. To only
scaffold a Controller for an existing Resource, select "n" for Resource. To only define
the schema for a Resource without writing a Controller, select "n" for Controller.
After the scaffold is written, api will run make on the project.
Usage:
kubebuilder [command]
Available Commands:
create Scaffold a Kubernetes API or webhook.
edit This command will edit the project configuration
help Help about any command
init Initialize a new project
version Print the kubebuilder version
Flags:
-h, --help help for kubebuilder
Use "kubebuilder [command] --help" for more information about a command.
2. 操作步骤
2.1. 初始化
mkdir $GOPATH/src/github.com/huweihuang/operator-example
cd $GOPATH/src/github.com/huweihuang/operator-example
go mod init github.com/huweihuang/operator-example
2.2. 创建项目
# kubebuilder init --domain github.com --license apache2 --owner "Hu Weihuang"
Writing scaffold for you to edit...
Get controller runtime:
$ go get sigs.k8s.io/controller-runtime@v0.5.0
Update go.mod:
$ go mod tidy
Running make:
$ make
go: creating new go.mod: module tmp
go: finding sigs.k8s.io v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd/controller-gen v0.2.5
/Users/weihuanghu/go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
go build -o bin/manager main.go
Next: define a resource with:
$ kubebuilder create api
查看生成文件:
./
├── Dockerfile
├── Makefile
├── PROJECT
├── bin
│ └── manager
├── config
│ ├── certmanager
│ │ ├── certificate.yaml
│ │ ├── kustomization.yaml
│ │ └── kustomizeconfig.yaml
│ ├── default
│ │ ├── kustomization.yaml
│ │ ├── manager_auth_proxy_patch.yaml
│ │ ├── manager_webhook_patch.yaml
│ │ └── webhookcainjection_patch.yaml
│ ├── manager
│ │ ├── kustomization.yaml
│ │ └── manager.yaml
│ ├── prometheus
│ │ ├── kustomization.yaml
│ │ └── monitor.yaml
│ ├── rbac
│ │ ├── auth_proxy_client_clusterrole.yaml
│ │ ├── auth_proxy_role.yaml
│ │ ├── auth_proxy_role_binding.yaml
│ │ ├── auth_proxy_service.yaml
│ │ ├── kustomization.yaml
│ │ ├── leader_election_role.yaml
│ │ ├── leader_election_role_binding.yaml
│ │ └── role_binding.yaml
│ └── webhook
│ ├── kustomization.yaml
│ ├── kustomizeconfig.yaml
│ └── service.yaml
├── go.mod
├── go.sum
├── hack
│ └── boilerplate.go.txt
└── main.go
2.3. 创建API
# kubebuilder create api --group webapp --version v1 --kind Guestbook
Create Resource [y/n]
y
Create Controller [y/n]
y
Writing scaffold for you to edit...
api/v1/guestbook_types.go
controllers/guestbook_controller.go
Running make:
$ make
go: creating new go.mod: module tmp
go: finding sigs.k8s.io/controller-tools/cmd v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd/controller-gen v0.2.5
go: finding sigs.k8s.io v0.2.5
/Users/weihuanghu/go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
go build -o bin/manager main.go
查看创建文件
api
└── v1
├── groupversion_info.go
├── guestbook_types.go
└── zz_generated.deepcopy.go
controllers
├── guestbook_controller.go
└── suite_test.go
查看api/v1/guestbook_types.go
// GuestbookSpec defines the desired state of Guestbook
type GuestbookSpec struct {
// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
// Important: Run "make" to regenerate code after modifying this file
// Quantity of instances
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=10
Size int32 `json:"size"`
// Name of the ConfigMap for GuestbookSpec's configuration
// +kubebuilder:validation:MaxLength=15
// +kubebuilder:validation:MinLength=1
ConfigMapName string `json:"configMapName"`
// +kubebuilder:validation:Enum=Phone;Address;Name
Type string `json:"alias,omitempty"`
}
// GuestbookStatus defines the observed state of Guestbook
type GuestbookStatus struct {
// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
// Important: Run "make" to regenerate code after modifying this file
// PodName of the active Guestbook node.
Active string `json:"active"`
// PodNames of the standby Guestbook nodes.
Standby []string `json:"standby"`
}
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:resource:scope=Cluster
// Guestbook is the Schema for the guestbooks API
type Guestbook struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec GuestbookSpec `json:"spec,omitempty"`
Status GuestbookStatus `json:"status,omitempty"`
}
3. troubleshooting
3.1. controller-gen: No such file or directory
➜ operator-example kubebuilder init --domain github.com --license apache2 --owner "Hu Weihuang"
Writing scaffold for you to edit...
Get controller runtime:
$ go get sigs.k8s.io/controller-runtime@v0.5.0
Update go.mod:
$ go mod tidy
Running make:
$ make
go: creating new go.mod: module tmp
go: finding sigs.k8s.io v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd/controller-gen v0.2.5
/Users/weihuanghu/go:/Users/weihuanghu/k8spath/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
/bin/sh: /Users/weihuanghu/go:/Users/weihuanghu/k8spath/bin/controller-gen: No such file or directory
make: *** [generate] Error 127
2020/04/13 14:34:47 failed to initialize project: exit status 2
由于本地存在多个GOPATH的目录,而获取了非当前项目下的GOPATH目录,因此将当前项目所在的GOPATH目录export到GOPATH环境变量中,就可以解决。
export GOPATH="/path/to/gopath"
参考:
9.3 - CSI插件开发
9.3.1 - csi-provisioner源码分析
本文主要分析
csi-provisioner的源码,关于开发一个Dynamic Provisioner,具体可参考nfs-client-provisioner的源码分析
1. Dynamic Provisioner
1.1. Provisioner Interface
开发Dynamic Provisioner需要实现Provisioner接口,该接口有两个方法,分别是:
- Provision:创建存储资源,并且返回一个PV对象。
- Delete:移除对应的存储资源,但并没有删除PV对象。
1.2. 开发provisioner的步骤
- 写一个
provisioner实现Provisioner接口(包含Provision和Delete的方法)。 - 通过该
provisioner构建ProvisionController。 - 执行
ProvisionController的Run方法。
2. CSI Provisioner
CSI Provisioner的源码可参考:https://github.com/kubernetes-csi/external-provisioner。
2.1. Main 函数
2.1.1. 读取环境变量
源码如下:
var (
provisioner = flag.String("provisioner", "", "Name of the provisioner. The provisioner will only provision volumes for claims that request a StorageClass with a provisioner field set equal to this name.")
master = flag.String("master", "", "Master URL to build a client config from. Either this or kubeconfig needs to be set if the provisioner is being run out of cluster.")
kubeconfig = flag.String("kubeconfig", "", "Absolute path to the kubeconfig file. Either this or master needs to be set if the provisioner is being run out of cluster.")
csiEndpoint = flag.String("csi-address", "/run/csi/socket", "The gRPC endpoint for Target CSI Volume")
connectionTimeout = flag.Duration("connection-timeout", 10*time.Second, "Timeout for waiting for CSI driver socket.")
volumeNamePrefix = flag.String("volume-name-prefix", "pvc", "Prefix to apply to the name of a created volume")
volumeNameUUIDLength = flag.Int("volume-name-uuid-length", -1, "Truncates generated UUID of a created volume to this length. Defaults behavior is to NOT truncate.")
showVersion = flag.Bool("version", false, "Show version.")
provisionController *controller.ProvisionController
version = "unknown"
)
func init() {
var config *rest.Config
var err error
flag.Parse()
flag.Set("logtostderr", "true")
if *showVersion {
fmt.Println(os.Args[0], version)
os.Exit(0)
}
glog.Infof("Version: %s", version)
...
}
通过init函数解析相关参数,其实provisioner指明为PVC提供PV的provisioner的名字,需要和StorageClass对象中的provisioner字段一致。
2.1.2. 获取clientset对象
源码如下:
// get the KUBECONFIG from env if specified (useful for local/debug cluster)
kubeconfigEnv := os.Getenv("KUBECONFIG")
if kubeconfigEnv != "" {
glog.Infof("Found KUBECONFIG environment variable set, using that..")
kubeconfig = &kubeconfigEnv
}
if *master != "" || *kubeconfig != "" {
glog.Infof("Either master or kubeconfig specified. building kube config from that..")
config, err = clientcmd.BuildConfigFromFlags(*master, *kubeconfig)
} else {
glog.Infof("Building kube configs for running in cluster...")
config, err = rest.InClusterConfig()
}
if err != nil {
glog.Fatalf("Failed to create config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
glog.Fatalf("Failed to create client: %v", err)
}
// snapclientset.NewForConfig creates a new Clientset for VolumesnapshotV1alpha1Client
snapClient, err := snapclientset.NewForConfig(config)
if err != nil {
glog.Fatalf("Failed to create snapshot client: %v", err)
}
csiAPIClient, err := csiclientset.NewForConfig(config)
if err != nil {
glog.Fatalf("Failed to create CSI API client: %v", err)
}
通过读取对应的k8s的配置,创建clientset对象,用来执行k8s对应的API,其中主要包括对PV和PVC等对象的创建删除等操作。
2.1.3. k8s版本校验
// The controller needs to know what the server version is because out-of-tree
// provisioners aren't officially supported until 1.5
serverVersion, err := clientset.Discovery().ServerVersion()
if err != nil {
glog.Fatalf("Error getting server version: %v", err)
}
获取了k8s的版本信息,因为provisioners的功能在k8s 1.5及以上版本才支持。
2.1.4. 连接 csi socket
// Generate a unique ID for this provisioner
timeStamp := time.Now().UnixNano() / int64(time.Millisecond)
identity := strconv.FormatInt(timeStamp, 10) + "-" + strconv.Itoa(rand.Intn(10000)) + "-" + *provisioner
// Provisioner will stay in Init until driver opens csi socket, once it's done
// controller will exit this loop and proceed normally.
socketDown := true
grpcClient := &grpc.ClientConn{}
for socketDown {
grpcClient, err = ctrl.Connect(*csiEndpoint, *connectionTimeout)
if err == nil {
socketDown = false
continue
}
time.Sleep(10 * time.Second)
}
在Provisioner会停留在初始化状态,直到csi socket连接成功才正常运行。如果连接失败,会暂停10秒后重试,其中涉及以下2个参数:
- csiEndpoint:CSI Volume的gRPC地址,默认通过为
/run/csi/socket。 - connectionTimeout:连接CSI driver socket的超时时间,默认为10秒。
2.1.5. 构造csi-Provisioner对象
// Create the provisioner: it implements the Provisioner interface expected by
// the controller
csiProvisioner := ctrl.NewCSIProvisioner(clientset, csiAPIClient, *csiEndpoint, *connectionTimeout, identity, *volumeNamePrefix, *volumeNameUUIDLength, grpcClient, snapClient)
provisionController = controller.NewProvisionController(
clientset,
*provisioner,
csiProvisioner,
serverVersion.GitVersion,
)
通过参数clientset, csiAPIClient, csiEndpoint, connectionTimeout, identity, volumeNamePrefix, volumeNameUUIDLength, grpcClient, snapClient构造csi-Provisioner对象。
通过csiProvisioner构造ProvisionController对象。
2.1.6. 运行ProvisionController
func main() {
provisionController.Run(wait.NeverStop)
}
ProvisionController实现了具体的PV和PVC的相关逻辑,Run方法以常驻进程的方式运行。
2.2. Provision和Delete方法
2.2.1. Provision方法
csiProvisioner的Provision方法具体源码参考:https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L336
Provision方法用来创建存储资源,并且返回一个PV对象。其中入参是VolumeOptions,用来指定PV对象的相关属性。
1、构造PV相关属性
pvName, err := makeVolumeName(p.volumeNamePrefix, fmt.Sprintf("%s", options.PVC.ObjectMeta.UID), p.volumeNameUUIDLength)
if err != nil {
return nil, err
}
2、构造CSIPersistentVolumeSource相关属性
driverState, err := checkDriverState(p.grpcClient, p.timeout, needSnapshotSupport)
if err != nil {
return nil, err
}
...
// Resolve controller publish, node stage, node publish secret references
controllerPublishSecretRef, err := getSecretReference(controllerPublishSecretNameKey, controllerPublishSecretNamespaceKey, options.Parameters, pvName, options.PVC)
if err != nil {
return nil, err
}
nodeStageSecretRef, err := getSecretReference(nodeStageSecretNameKey, nodeStageSecretNamespaceKey, options.Parameters, pvName, options.PVC)
if err != nil {
return nil, err
}
nodePublishSecretRef, err := getSecretReference(nodePublishSecretNameKey, nodePublishSecretNamespaceKey, options.Parameters, pvName, options.PVC)
if err != nil {
return nil, err
}
...
volumeAttributes := map[string]string{provisionerIDKey: p.identity}
for k, v := range rep.Volume.Attributes {
volumeAttributes[k] = v
}
...
fsType := ""
for k, v := range options.Parameters {
switch strings.ToLower(k) {
case "fstype":
fsType = v
}
}
if len(fsType) == 0 {
fsType = defaultFSType
}
3、创建CSI CreateVolumeRequest
// Create a CSI CreateVolumeRequest and Response
req := csi.CreateVolumeRequest{
Name: pvName,
Parameters: options.Parameters,
VolumeCapabilities: volumeCaps,
CapacityRange: &csi.CapacityRange{
RequiredBytes: int64(volSizeBytes),
},
}
...
glog.V(5).Infof("CreateVolumeRequest %+v", req)
rep := &csi.CreateVolumeResponse{}
...
opts := wait.Backoff{Duration: backoffDuration, Factor: backoffFactor, Steps: backoffSteps}
err = wait.ExponentialBackoff(opts, func() (bool, error) {
ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
defer cancel()
rep, err = p.csiClient.CreateVolume(ctx, &req)
if err == nil {
// CreateVolume has finished successfully
return true, nil
}
if status, ok := status.FromError(err); ok {
if status.Code() == codes.DeadlineExceeded {
// CreateVolume timed out, give it another chance to complete
glog.Warningf("CreateVolume timeout: %s has expired, operation will be retried", p.timeout.String())
return false, nil
}
}
// CreateVolume failed , no reason to retry, bailing from ExponentialBackoff
return false, err
})
if err != nil {
return nil, err
}
if rep.Volume != nil {
glog.V(3).Infof("create volume rep: %+v", *rep.Volume)
}
respCap := rep.GetVolume().GetCapacityBytes()
if respCap < volSizeBytes {
capErr := fmt.Errorf("created volume capacity %v less than requested capacity %v", respCap, volSizeBytes)
delReq := &csi.DeleteVolumeRequest{
VolumeId: rep.GetVolume().GetId(),
}
delReq.ControllerDeleteSecrets = provisionerCredentials
ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
defer cancel()
_, err := p.csiClient.DeleteVolume(ctx, delReq)
if err != nil {
capErr = fmt.Errorf("%v. Cleanup of volume %s failed, volume is orphaned: %v", capErr, pvName, err)
}
return nil, capErr
}
Provison方法核心功能是调用p.csiClient.CreateVolume(ctx, &req)。
4、构造PV对象
pv := &v1.PersistentVolume{
ObjectMeta: metav1.ObjectMeta{
Name: pvName,
},
Spec: v1.PersistentVolumeSpec{
PersistentVolumeReclaimPolicy: options.PersistentVolumeReclaimPolicy,
AccessModes: options.PVC.Spec.AccessModes,
Capacity: v1.ResourceList{
v1.ResourceName(v1.ResourceStorage): bytesToGiQuantity(respCap),
},
// TODO wait for CSI VolumeSource API
PersistentVolumeSource: v1.PersistentVolumeSource{
CSI: &v1.CSIPersistentVolumeSource{
Driver: driverState.driverName,
VolumeHandle: p.volumeIdToHandle(rep.Volume.Id),
FSType: fsType,
VolumeAttributes: volumeAttributes,
ControllerPublishSecretRef: controllerPublishSecretRef,
NodeStageSecretRef: nodeStageSecretRef,
NodePublishSecretRef: nodePublishSecretRef,
},
},
},
}
if driverState.capabilities.Has(PluginCapability_ACCESSIBILITY_CONSTRAINTS) {
pv.Spec.NodeAffinity = GenerateVolumeNodeAffinity(rep.Volume.AccessibleTopology)
}
glog.Infof("successfully created PV %+v", pv.Spec.PersistentVolumeSource)
return pv, nil
Provision方法只是通过VolumeOptions参数来构建PV对象,并没有执行具体PV的创建或删除的操作。
不同类型的Provisioner的,一般是PersistentVolumeSource类型和参数不同,例如csi-provisioner对应的PersistentVolumeSource为CSI,并且需要传入CSI相关的参数:
DriverVolumeHandleFSTypeVolumeAttributesControllerPublishSecretRefNodeStageSecretRefNodePublishSecretRef
2.2.2. Delete方法
csiProvisioner的delete方法具体源码参考:https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L606
func (p *csiProvisioner) Delete(volume *v1.PersistentVolume) error {
if volume == nil || volume.Spec.CSI == nil {
return fmt.Errorf("invalid CSI PV")
}
volumeId := p.volumeHandleToId(volume.Spec.CSI.VolumeHandle)
_, err := checkDriverState(p.grpcClient, p.timeout, false)
if err != nil {
return err
}
req := csi.DeleteVolumeRequest{
VolumeId: volumeId,
}
// get secrets if StorageClass specifies it
storageClassName := volume.Spec.StorageClassName
if len(storageClassName) != 0 {
if storageClass, err := p.client.StorageV1().StorageClasses().Get(storageClassName, metav1.GetOptions{}); err == nil {
// Resolve provision secret credentials.
// No PVC is provided when resolving provision/delete secret names, since the PVC may or may not exist at delete time.
provisionerSecretRef, err := getSecretReference(provisionerSecretNameKey, provisionerSecretNamespaceKey, storageClass.Parameters, volume.Name, nil)
if err != nil {
return err
}
credentials, err := getCredentials(p.client, provisionerSecretRef)
if err != nil {
return err
}
req.ControllerDeleteSecrets = credentials
}
}
ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
defer cancel()
_, err = p.csiClient.DeleteVolume(ctx, &req)
return err
}
Delete方法主要是调用了p.csiClient.DeleteVolume(ctx, &req)方法。
2.3. 总结
csi provisioner实现了Provisioner接口,其中包含Provison和Delete两个方法:
Provision:调用csiClient.CreateVolume方法,同时构造并返回PV对象。Delete:调用csiClient.DeleteVolume方法。
csi provisioner的核心方法都调用了csi-client相关方法。
3. csi-client
csi client的相关代码参考:https://github.com/container-storage-interface/spec/blob/master/lib/go/csi/v0/csi.pb.go
3.1. 构造csi-client
3.1.1. 构造grpcClient
// Provisioner will stay in Init until driver opens csi socket, once it's done
// controller will exit this loop and proceed normally.
socketDown := true
grpcClient := &grpc.ClientConn{}
for socketDown {
grpcClient, err = ctrl.Connect(*csiEndpoint, *connectionTimeout)
if err == nil {
socketDown = false
continue
}
time.Sleep(10 * time.Second)
}
通过连接csi socket,连接成功才构造可用的grpcClient。
3.1.2. 构造csi-client
通过grpcClient构造csi-client。
// Create the provisioner: it implements the Provisioner interface expected by
// the controller
csiProvisioner := ctrl.NewCSIProvisioner(clientset, csiAPIClient, *csiEndpoint, *connectionTimeout, identity, *volumeNamePrefix, *volumeNameUUIDLength, grpcClient, snapClient)
NewCSIProvisioner
// NewCSIProvisioner creates new CSI provisioner
func NewCSIProvisioner(client kubernetes.Interface,
csiAPIClient csiclientset.Interface,
csiEndpoint string,
connectionTimeout time.Duration,
identity string,
volumeNamePrefix string,
volumeNameUUIDLength int,
grpcClient *grpc.ClientConn,
snapshotClient snapclientset.Interface) controller.Provisioner {
csiClient := csi.NewControllerClient(grpcClient)
provisioner := &csiProvisioner{
client: client,
grpcClient: grpcClient,
csiClient: csiClient,
csiAPIClient: csiAPIClient,
snapshotClient: snapshotClient,
timeout: connectionTimeout,
identity: identity,
volumeNamePrefix: volumeNamePrefix,
volumeNameUUIDLength: volumeNameUUIDLength,
}
return provisioner
}
csiClient := csi.NewControllerClient(grpcClient)
...
type controllerClient struct {
cc *grpc.ClientConn
}
func NewControllerClient(cc *grpc.ClientConn) ControllerClient {
return &controllerClient{cc}
}
3.2. csiClient.CreateVolume
csi provisoner中调用csiClient.CreateVolume代码如下:
opts := wait.Backoff{Duration: backoffDuration, Factor: backoffFactor, Steps: backoffSteps}
err = wait.ExponentialBackoff(opts, func() (bool, error) {
ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
defer cancel()
rep, err = p.csiClient.CreateVolume(ctx, &req)
if err == nil {
// CreateVolume has finished successfully
return true, nil
}
if status, ok := status.FromError(err); ok {
if status.Code() == codes.DeadlineExceeded {
// CreateVolume timed out, give it another chance to complete
glog.Warningf("CreateVolume timeout: %s has expired, operation will be retried", p.timeout.String())
return false, nil
}
}
// CreateVolume failed , no reason to retry, bailing from ExponentialBackoff
return false, err
})
CreateVolumeRequest的构造:
// Create a CSI CreateVolumeRequest and Response
req := csi.CreateVolumeRequest{
Name: pvName,
Parameters: options.Parameters,
VolumeCapabilities: volumeCaps,
CapacityRange: &csi.CapacityRange{
RequiredBytes: int64(volSizeBytes),
},
}
...
req.VolumeContentSource = volumeContentSource
...
req.AccessibilityRequirements = requirements
...
req.ControllerCreateSecrets = provisionerCredentials
具体的Create实现方法如下:
其中
csiClient是个接口类型
具体代码参考controllerClient.CreateVolume
func (c *controllerClient) CreateVolume(ctx context.Context, in *CreateVolumeRequest, opts ...grpc.CallOption) (*CreateVolumeResponse, error) {
out := new(CreateVolumeResponse)
err := grpc.Invoke(ctx, "/csi.v0.Controller/CreateVolume", in, out, c.cc, opts...)
if err != nil {
return nil, err
}
return out, nil
}
3.3. csiClient.DeleteVolume
csi provisoner中调用csiClient.DeleteVolume代码如下:
func (p *csiProvisioner) Delete(volume *v1.PersistentVolume) error {
...
req := csi.DeleteVolumeRequest{
VolumeId: volumeId,
}
// get secrets if StorageClass specifies it
...
ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
defer cancel()
_, err = p.csiClient.DeleteVolume(ctx, &req)
return err
}
DeleteVolumeRequest的构造:
req := csi.DeleteVolumeRequest{
VolumeId: volumeId,
}
...
req.ControllerDeleteSecrets = credentials
将构造的DeleteVolumeRequest传给DeleteVolume方法。
具体的Delete实现方法如下:
具体代码参考:controllerClient.DeleteVolume
func (c *controllerClient) DeleteVolume(ctx context.Context, in *DeleteVolumeRequest, opts ...grpc.CallOption) (*DeleteVolumeResponse, error) {
out := new(DeleteVolumeResponse)
err := grpc.Invoke(ctx, "/csi.v0.Controller/DeleteVolume", in, out, c.cc, opts...)
if err != nil {
return nil, err
}
return out, nil
}
4. ProvisionController.Run
自定义的provisioner实现了Provisoner接口的Provision和Delete方法,这两个方法主要对后端存储做创建和删除操作,并没有对PV对象进行创建和删除操作。
PV对象的相关操作具体由ProvisionController中的provisionClaimOperation和deleteVolumeOperation具体执行,同时调用了具体provisioner的Provision和Delete两个方法来对存储数据做处理。
func main() {
provisionController.Run(wait.NeverStop)
}
这块代码逻辑可参考:nfs-client-provisioner 源码分析
参考文章:
9.3.2 - nfs-client-provisioner源码分析
如果要开发一个
Dynamic Provisioner,需要使用到the helper library。
1. Dynamic Provisioner
1.1. Provisioner Interface
开发Dynamic Provisioner需要实现Provisioner接口,该接口有两个方法,分别是:
- Provision:创建存储资源,并且返回一个PV对象。
- Delete:移除对应的存储资源,但并没有删除PV对象。
Provisioner 接口源码如下:
// Provisioner is an interface that creates templates for PersistentVolumes
// and can create the volume as a new resource in the infrastructure provider.
// It can also remove the volume it created from the underlying storage
// provider.
type Provisioner interface {
// Provision creates a volume i.e. the storage asset and returns a PV object
// for the volume
Provision(VolumeOptions) (*v1.PersistentVolume, error)
// Delete removes the storage asset that was created by Provision backing the
// given PV. Does not delete the PV object itself.
//
// May return IgnoredError to indicate that the call has been ignored and no
// action taken.
Delete(*v1.PersistentVolume) error
}
1.2. VolumeOptions
Provisioner接口的Provision方法的入参是一个VolumeOptions对象。VolumeOptions对象包含了创建PV对象所需要的信息,例如:PV的回收策略,PV的名字,PV所对应的PVC对象以及PVC的StorageClass对象使用的参数等。
VolumeOptions 源码如下:
// VolumeOptions contains option information about a volume
// https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/volume/plugins.go
type VolumeOptions struct {
// Reclamation policy for a persistent volume
PersistentVolumeReclaimPolicy v1.PersistentVolumeReclaimPolicy
// PV.Name of the appropriate PersistentVolume. Used to generate cloud
// volume name.
PVName string
// PV mount options. Not validated - mount of the PVs will simply fail if one is invalid.
MountOptions []string
// PVC is reference to the claim that lead to provisioning of a new PV.
// Provisioners *must* create a PV that would be matched by this PVC,
// i.e. with required capacity, accessMode, labels matching PVC.Selector and
// so on.
PVC *v1.PersistentVolumeClaim
// Volume provisioning parameters from StorageClass
Parameters map[string]string
// Node selected by the scheduler for the volume.
SelectedNode *v1.Node
// Topology constraint parameter from StorageClass
AllowedTopologies []v1.TopologySelectorTerm
}
1.3. ProvisionController
ProvisionController是一个给PVC提供PV的控制器,具体执行Provisioner接口的Provision和Delete的方法的所有逻辑。
1.4. 开发provisioner的步骤
- 写一个
provisioner实现Provisioner接口(包含Provision和Delete的方法)。 - 通过该
provisioner构建ProvisionController。 - 执行
ProvisionController的Run方法。
2. NFS Client Provisioner
nfs-client-provisioner是一个automatic provisioner,使用NFS作为存储,自动创建PV和对应的PVC,本身不提供NFS存储,需要外部先有一套NFS存储服务。
- PV以
${namespace}-${pvcName}-${pvName}的命名格式提供(在NFS服务器上) - PV回收的时候以
archieved-${namespace}-${pvcName}-${pvName}的命名格式(在NFS服务器上)
以下通过nfs-client-provisioner的源码分析来说明开发自定义provisioner整个过程。nfs-client-provisioner的主要代码都在provisioner.go的文件中。
nfs-client-provisioner源码地址:https://github.com/kubernetes-incubator/external-storage/tree/master/nfs-client
2.1. Main函数
2.1.1. 读取环境变量
源码如下:
func main() {
flag.Parse()
flag.Set("logtostderr", "true")
server := os.Getenv("NFS_SERVER")
if server == "" {
glog.Fatal("NFS_SERVER not set")
}
path := os.Getenv("NFS_PATH")
if path == "" {
glog.Fatal("NFS_PATH not set")
}
provisionerName := os.Getenv(provisionerNameKey)
if provisionerName == "" {
glog.Fatalf("environment variable %s is not set! Please set it.", provisionerNameKey)
}
...
}
main函数先获取NFS_SERVER、NFS_PATH、PROVISIONER_NAME三个环境变量的值,因此在部署nfs-client-provisioner的时候,需要将这三个环境变量的值传入。
NFS_SERVER:NFS服务端的IP地址。NFS_PATH:NFS服务端设置的共享目录PROVISIONER_NAME:provisioner的名字,需要和StorageClass对象中的provisioner字段一致。
例如StorageClass对象的yaml文件如下:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-nfs-storage
provisioner: fuseim.pri/ifs # or choose another name, must match deployment's env PROVISIONER_NAME'
parameters:
archiveOnDelete: "false" # When set to "false" your PVs will not be archived by the provisioner upon deletion of the PVC.
2.1.2. 获取clientset对象
源码如下:
// Create an InClusterConfig and use it to create a client for the controller
// to use to communicate with Kubernetes
config, err := rest.InClusterConfig()
if err != nil {
glog.Fatalf("Failed to create config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
glog.Fatalf("Failed to create client: %v", err)
}
通过读取对应的k8s的配置,创建clientset对象,用来执行k8s对应的API,其中主要包括对PV和PVC等对象的创建删除等操作。
2.1.3. 构造nfsProvisioner对象
源码如下:
// The controller needs to know what the server version is because out-of-tree
// provisioners aren't officially supported until 1.5
serverVersion, err := clientset.Discovery().ServerVersion()
if err != nil {
glog.Fatalf("Error getting server version: %v", err)
}
clientNFSProvisioner := &nfsProvisioner{
client: clientset,
server: server,
path: path,
}
通过clientset、server、path等值构造nfsProvisioner对象,同时还获取了k8s的版本信息,因为provisioners的功能在k8s 1.5及以上版本才支持。
nfsProvisioner类型定义如下:
type nfsProvisioner struct {
client kubernetes.Interface
server string
path string
}
var _ controller.Provisioner = &nfsProvisioner{}
nfsProvisioner是一个自定义的provisioner,用来实现Provisioner的接口,其中的属性除了server、path这两个关于NFS相关的参数,还包含了client,主要用来调用k8s的API。
var _ controller.Provisioner = &nfsProvisioner{}
以上用法用来检测nfsProvisioner是否实现了Provisioner的接口。
2.1.4. 构建并运行ProvisionController
源码如下:
// Start the provision controller which will dynamically provision efs NFS
// PVs
pc := controller.NewProvisionController(clientset, provisionerName, clientNFSProvisioner, serverVersion.GitVersion)
pc.Run(wait.NeverStop)
通过nfsProvisioner构造ProvisionController对象并执行Run方法,ProvisionController实现了具体的PV和PVC的相关逻辑,Run方法以常驻进程的方式运行。
2.2. Provision和Delete方法
2.2.1. Provision方法
nfsProvisioner的Provision方法具体源码参考:https://github.com/kubernetes-incubator/external-storage/blob/master/nfs-client/cmd/nfs-client-provisioner/provisioner.go#L56
Provision方法用来创建存储资源,并且返回一个PV对象。其中入参是VolumeOptions,用来指定PV对象的相关属性。
1、构建PV和PVC的名称
func (p *nfsProvisioner) Provision(options controller.VolumeOptions) (*v1.PersistentVolume, error) {
if options.PVC.Spec.Selector != nil {
return nil, fmt.Errorf("claim Selector is not supported")
}
glog.V(4).Infof("nfs provisioner: VolumeOptions %v", options)
pvcNamespace := options.PVC.Namespace
pvcName := options.PVC.Name
pvName := strings.Join([]string{pvcNamespace, pvcName, options.PVName}, "-")
fullPath := filepath.Join(mountPath, pvName)
glog.V(4).Infof("creating path %s", fullPath)
if err := os.MkdirAll(fullPath, 0777); err != nil {
return nil, errors.New("unable to create directory to provision new pv: " + err.Error())
}
os.Chmod(fullPath, 0777)
path := filepath.Join(p.path, pvName)
...
}
通过VolumeOptions的入参,构建PV和PVC的名称,以及创建路径path。
2、构造PV对象
pv := &v1.PersistentVolume{
ObjectMeta: metav1.ObjectMeta{
Name: options.PVName,
},
Spec: v1.PersistentVolumeSpec{
PersistentVolumeReclaimPolicy: options.PersistentVolumeReclaimPolicy,
AccessModes: options.PVC.Spec.AccessModes,
MountOptions: options.MountOptions,
Capacity: v1.ResourceList{
v1.ResourceName(v1.ResourceStorage): options.PVC.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)],
},
PersistentVolumeSource: v1.PersistentVolumeSource{
NFS: &v1.NFSVolumeSource{
Server: p.server,
Path: path,
ReadOnly: false,
},
},
},
}
return pv, nil
综上可以看出,Provision方法只是通过VolumeOptions参数来构建PV对象,并没有执行具体PV的创建或删除的操作。
不同类型的Provisioner的,一般是PersistentVolumeSource类型和参数不同,例如nfs-provisioner对应的PersistentVolumeSource为NFS,并且需要传入NFS相关的参数:Server,Path等。
2.2.2. Delete方法
nfsProvisioner的delete方法具体源码参考:https://github.com/kubernetes-incubator/external-storage/blob/master/nfs-client/cmd/nfs-client-provisioner/provisioner.go#L99
1、获取pvName和path等相关参数
func (p *nfsProvisioner) Delete(volume *v1.PersistentVolume) error {
path := volume.Spec.PersistentVolumeSource.NFS.Path
pvName := filepath.Base(path)
oldPath := filepath.Join(mountPath, pvName)
if _, err := os.Stat(oldPath); os.IsNotExist(err) {
glog.Warningf("path %s does not exist, deletion skipped", oldPath)
return nil
}
...
}
通过path和pvName生成oldPath,其中oldPath是原先NFS服务器上pod对应的数据持久化存储路径。
2、获取archiveOnDelete参数并删除数据
// Get the storage class for this volume.
storageClass, err := p.getClassForVolume(volume)
if err != nil {
return err
}
// Determine if the "archiveOnDelete" parameter exists.
// If it exists and has a falsey value, delete the directory.
// Otherwise, archive it.
archiveOnDelete, exists := storageClass.Parameters["archiveOnDelete"]
if exists {
archiveBool, err := strconv.ParseBool(archiveOnDelete)
if err != nil {
return err
}
if !archiveBool {
return os.RemoveAll(oldPath)
}
}
如果storageClass对象中指定archiveOnDelete参数并且值为false,则会自动删除oldPath下的所有数据,即pod对应的数据持久化存储数据。
archiveOnDelete字面意思为删除时是否存档,false表示不存档,即删除数据,true表示存档,即重命名路径。
3、重命名旧数据路径
archivePath := filepath.Join(mountPath, "archived-"+pvName)
glog.V(4).Infof("archiving path %s to %s", oldPath, archivePath)
return os.Rename(oldPath, archivePath)
如果storageClass对象中没有指定archiveOnDelete参数或者值为true,表明需要删除时存档,即将oldPath重命名,命名格式为oldPath前面增加archived-的前缀。
3. ProvisionController
3.1. ProvisionController结构体
源码具体参考:https://github.com/kubernetes-incubator/external-storage/blob/master/lib/controller/controller.go#L82
ProvisionController是一个给PVC提供PV的控制器,具体执行Provisioner接口的Provision和Delete的方法的所有逻辑。
3.1.1. 入参
// ProvisionController is a controller that provisions PersistentVolumes for
// PersistentVolumeClaims.
type ProvisionController struct {
client kubernetes.Interface
// The name of the provisioner for which this controller dynamically
// provisions volumes. The value of annDynamicallyProvisioned and
// annStorageProvisioner to set & watch for, respectively
provisionerName string
// The provisioner the controller will use to provision and delete volumes.
// Presumably this implementer of Provisioner carries its own
// volume-specific options and such that it needs in order to provision
// volumes.
provisioner Provisioner
// Kubernetes cluster server version:
// * 1.4: storage classes introduced as beta. Technically out-of-tree dynamic
// provisioning is not officially supported, though it works
// * 1.5: storage classes stay in beta. Out-of-tree dynamic provisioning is
// officially supported
// * 1.6: storage classes enter GA
kubeVersion *utilversion.Version
...
}
client、provisionerName、provisioner、kubeVersion等属性作为NewProvisionController的入参。
client:clientset客户端,用来调用k8s的API。provisionerName:provisioner的名字,需要和StorageClass对象中的provisioner字段一致。provisioner:具体的provisioner的实现者,本文为nfsProvisioner。kubeVersion:k8s的版本信息。
3.1.2. Controller和Informer
type ProvisionController struct {
...
claimInformer cache.SharedInformer
claims cache.Store
claimController cache.Controller
volumeInformer cache.SharedInformer
volumes cache.Store
volumeController cache.Controller
classInformer cache.SharedInformer
classes cache.Store
classController cache.Controller
...
}
ProvisionController结构体中包含了PV、PVC、StorageClass三个对象的Controller、Informer和Store,主要用来执行这三个对象的相关操作。
- Controller:通用的控制框架
- Informer:消息通知器
- Store:通用的对象存储接口
3.1.3. workqueue
type ProvisionController struct {
...
claimQueue workqueue.RateLimitingInterface
volumeQueue workqueue.RateLimitingInterface
...
}
claimQueue和volumeQueue分别是PV和PVC的任务队列。
3.1.4. 其他
// Identity of this controller, generated at creation time and not persisted
// across restarts. Useful only for debugging, for seeing the source of
// events. controller.provisioner may have its own, different notion of
// identity which may/may not persist across restarts
id string
component string
eventRecorder record.EventRecorder
resyncPeriod time.Duration
exponentialBackOffOnError bool
threadiness int
createProvisionedPVRetryCount int
createProvisionedPVInterval time.Duration
failedProvisionThreshold, failedDeleteThreshold int
// The port for metrics server to serve on.
metricsPort int32
// The IP address for metrics server to serve on.
metricsAddress string
// The path of metrics endpoint path.
metricsPath string
// Parameters of leaderelection.LeaderElectionConfig.
leaseDuration, renewDeadline, retryPeriod time.Duration
hasRun bool
hasRunLock *sync.Mutex
3.2. NewProvisionController方法
源码地址:https://github.com/kubernetes-incubator/external-storage/blob/master/lib/controller/controller.go#L418
NewProvisionController方法主要用来构造ProvisionController。
3.2.1. 初始化默认值
// NewProvisionController creates a new provision controller using
// the given configuration parameters and with private (non-shared) informers.
func NewProvisionController(
client kubernetes.Interface,
provisionerName string,
provisioner Provisioner,
kubeVersion string,
options ...func(*ProvisionController) error,
) *ProvisionController {
...
controller := &ProvisionController{
client: client,
provisionerName: provisionerName,
provisioner: provisioner,
kubeVersion: utilversion.MustParseSemantic(kubeVersion),
id: id,
component: component,
eventRecorder: eventRecorder,
resyncPeriod: DefaultResyncPeriod,
exponentialBackOffOnError: DefaultExponentialBackOffOnError,
threadiness: DefaultThreadiness,
createProvisionedPVRetryCount: DefaultCreateProvisionedPVRetryCount,
createProvisionedPVInterval: DefaultCreateProvisionedPVInterval,
failedProvisionThreshold: DefaultFailedProvisionThreshold,
failedDeleteThreshold: DefaultFailedDeleteThreshold,
leaseDuration: DefaultLeaseDuration,
renewDeadline: DefaultRenewDeadline,
retryPeriod: DefaultRetryPeriod,
metricsPort: DefaultMetricsPort,
metricsAddress: DefaultMetricsAddress,
metricsPath: DefaultMetricsPath,
hasRun: false,
hasRunLock: &sync.Mutex{},
}
...
}
3.2.2. 初始化任务队列
ratelimiter := workqueue.NewMaxOfRateLimiter(
workqueue.NewItemExponentialFailureRateLimiter(15*time.Second, 1000*time.Second),
&workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
)
if !controller.exponentialBackOffOnError {
ratelimiter = workqueue.NewMaxOfRateLimiter(
workqueue.NewItemExponentialFailureRateLimiter(15*time.Second, 15*time.Second),
&workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
)
}
controller.claimQueue = workqueue.NewNamedRateLimitingQueue(ratelimiter, "claims")
controller.volumeQueue = workqueue.NewNamedRateLimitingQueue(ratelimiter, "volumes")
3.2.3. ListWatch
// PVC
claimSource := &cache.ListWatch{
ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
return client.CoreV1().PersistentVolumeClaims(v1.NamespaceAll).List(options)
},
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
return client.CoreV1().PersistentVolumeClaims(v1.NamespaceAll).Watch(options)
},
}
// PV
volumeSource := &cache.ListWatch{
ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
return client.CoreV1().PersistentVolumes().List(options)
},
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
return client.CoreV1().PersistentVolumes().Watch(options)
},
}
// StorageClass
classSource = &cache.ListWatch{
ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
return client.StorageV1().StorageClasses().List(options)
},
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
return client.StorageV1().StorageClasses().Watch(options)
},
}
list-watch机制是k8s中用来监听对象变化的核心机制,ListWatch包含ListFunc和WatchFunc两个函数,且不能为空,以上代码分别构造了PV、PVC、StorageClass三个对象的ListWatch结构体。该机制的实现在client-go的cache包中,具体参考:https://godoc.org/k8s.io/client-go/tools/cache。
更多ListWatch代码如下:
具体参考:https://github.com/kubernetes-incubator/external-storage/blob/89b0aaf6413b249b37834b124fc314ef7b8ee949/vendor/k8s.io/client-go/tools/cache/listwatch.go#L34
// ListerWatcher is any object that knows how to perform an initial list and start a watch on a resource.
type ListerWatcher interface {
// List should return a list type object; the Items field will be extracted, and the
// ResourceVersion field will be used to start the watch in the right place.
List(options metav1.ListOptions) (runtime.Object, error)
// Watch should begin a watch at the specified version.
Watch(options metav1.ListOptions) (watch.Interface, error)
}
// ListFunc knows how to list resources
type ListFunc func(options metav1.ListOptions) (runtime.Object, error)
// WatchFunc knows how to watch resources
type WatchFunc func(options metav1.ListOptions) (watch.Interface, error)
// ListWatch knows how to list and watch a set of apiserver resources. It satisfies the ListerWatcher interface.
// It is a convenience function for users of NewReflector, etc.
// ListFunc and WatchFunc must not be nil
type ListWatch struct {
ListFunc ListFunc
WatchFunc WatchFunc
// DisableChunking requests no chunking for this list watcher.
DisableChunking bool
}
3.2.4. ResourceEventHandlerFuncs
// PVC
claimHandler := cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { controller.enqueueWork(controller.claimQueue, obj) },
UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueWork(controller.claimQueue, newObj) },
DeleteFunc: func(obj interface{}) { controller.forgetWork(controller.claimQueue, obj) },
}
// PV
volumeHandler := cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { controller.enqueueWork(controller.volumeQueue, obj) },
UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueWork(controller.volumeQueue, newObj) },
DeleteFunc: func(obj interface{}) { controller.forgetWork(controller.volumeQueue, obj) },
}
// StorageClass
classHandler := cache.ResourceEventHandlerFuncs{
// We don't need an actual event handler for StorageClasses,
// but we must pass a non-nil one to cache.NewInformer()
AddFunc: nil,
UpdateFunc: nil,
DeleteFunc: nil,
}
ResourceEventHandlerFuncs是资源事件处理函数,主要用来对k8s资源对象增删改变化的事件进行消息通知,该函数实现了ResourceEventHandler的接口。具体代码逻辑在client-go的cache包中。
更多ResourceEventHandlerFuncs代码可参考:
// ResourceEventHandler can handle notifications for events that happen to a
// resource. The events are informational only, so you can't return an
// error.
// * OnAdd is called when an object is added.
// * OnUpdate is called when an object is modified. Note that oldObj is the
// last known state of the object-- it is possible that several changes
// were combined together, so you can't use this to see every single
// change. OnUpdate is also called when a re-list happens, and it will
// get called even if nothing changed. This is useful for periodically
// evaluating or syncing something.
// * OnDelete will get the final state of the item if it is known, otherwise
// it will get an object of type DeletedFinalStateUnknown. This can
// happen if the watch is closed and misses the delete event and we don't
// notice the deletion until the subsequent re-list.
type ResourceEventHandler interface {
OnAdd(obj interface{})
OnUpdate(oldObj, newObj interface{})
OnDelete(obj interface{})
}
// ResourceEventHandlerFuncs is an adaptor to let you easily specify as many or
// as few of the notification functions as you want while still implementing
// ResourceEventHandler.
type ResourceEventHandlerFuncs struct {
AddFunc func(obj interface{})
UpdateFunc func(oldObj, newObj interface{})
DeleteFunc func(obj interface{})
}
3.2.5. 构造Store和Controller
1、PVC
if controller.claimInformer != nil {
controller.claimInformer.AddEventHandlerWithResyncPeriod(claimHandler, controller.resyncPeriod)
controller.claims, controller.claimController =
controller.claimInformer.GetStore(),
controller.claimInformer.GetController()
} else {
controller.claims, controller.claimController =
cache.NewInformer(
claimSource,
&v1.PersistentVolumeClaim{},
controller.resyncPeriod,
claimHandler,
)
}
2、PV
if controller.volumeInformer != nil {
controller.volumeInformer.AddEventHandlerWithResyncPeriod(volumeHandler, controller.resyncPeriod)
controller.volumes, controller.volumeController =
controller.volumeInformer.GetStore(),
controller.volumeInformer.GetController()
} else {
controller.volumes, controller.volumeController =
cache.NewInformer(
volumeSource,
&v1.PersistentVolume{},
controller.resyncPeriod,
volumeHandler,
)
}
3、StorageClass
if controller.classInformer != nil {
// no resource event handler needed for StorageClasses
controller.classes, controller.classController =
controller.classInformer.GetStore(),
controller.classInformer.GetController()
} else {
controller.classes, controller.classController = cache.NewInformer(
classSource,
versionedClassType,
controller.resyncPeriod,
classHandler,
)
}
通过cache.NewInformer的方法构造,入参是ListWatch结构体和ResourceEventHandlerFuncs函数等,返回值是Store和Controller。
通过以上各个部分的构造,最后返回一个具体的ProvisionController对象。
3.3. ProvisionController.Run方法
ProvisionController的Run方法是以常驻进程的方式运行,函数内部再运行其他的controller。
3.3.1. prometheus数据收集
// Run starts all of this controller's control loops
func (ctrl *ProvisionController) Run(stopCh <-chan struct{}) {
run := func(stopCh <-chan struct{}) {
...
if ctrl.metricsPort > 0 {
prometheus.MustRegister([]prometheus.Collector{
metrics.PersistentVolumeClaimProvisionTotal,
metrics.PersistentVolumeClaimProvisionFailedTotal,
metrics.PersistentVolumeClaimProvisionDurationSeconds,
metrics.PersistentVolumeDeleteTotal,
metrics.PersistentVolumeDeleteFailedTotal,
metrics.PersistentVolumeDeleteDurationSeconds,
}...)
http.Handle(ctrl.metricsPath, promhttp.Handler())
address := net.JoinHostPort(ctrl.metricsAddress, strconv.FormatInt(int64(ctrl.metricsPort), 10))
glog.Infof("Starting metrics server at %s\n", address)
go wait.Forever(func() {
err := http.ListenAndServe(address, nil)
if err != nil {
glog.Errorf("Failed to listen on %s: %v", address, err)
}
}, 5*time.Second)
}
...
}
3.3.2. Controller.Run
// If a SharedInformer has been passed in, this controller should not
// call Run again
if ctrl.claimInformer == nil {
go ctrl.claimController.Run(stopCh)
}
if ctrl.volumeInformer == nil {
go ctrl.volumeController.Run(stopCh)
}
if ctrl.classInformer == nil {
go ctrl.classController.Run(stopCh)
}
运行消息通知器Informer。
3.3.3. Worker
for i := 0; i < ctrl.threadiness; i++ {
go wait.Until(ctrl.runClaimWorker, time.Second, stopCh)
go wait.Until(ctrl.runVolumeWorker, time.Second, stopCh)
}
runClaimWorker和runVolumeWorker分别为PVC和PV的worker,这两个的具体执行体分别是processNextClaimWorkItem和processNextVolumeWorkItem。
执行流程如下:
PVC的函数调用流程
runClaimWorker→processNextClaimWorkItem→syncClaimHandler→syncClaim→provisionClaimOperation
PV的函数调用流程
runVolumeWorker→processNextVolumeWorkItem→syncVolumeHandler→syncVolume→deleteVolumeOperation
可见最后执行的函数分别是provisionClaimOperation和deleteVolumeOperation。
3.4. Operation
3.4.1. provisionClaimOperation
1、provisionClaimOperation入参是PVC,通过PVC获得PV对象,并判断PV对象是否存在,如果存在则退出后续操作。
// provisionClaimOperation attempts to provision a volume for the given claim.
// Returns error, which indicates whether provisioning should be retried
// (requeue the claim) or not
func (ctrl *ProvisionController) provisionClaimOperation(claim *v1.PersistentVolumeClaim) error {
// Most code here is identical to that found in controller.go of kube's PV controller...
claimClass := helper.GetPersistentVolumeClaimClass(claim)
operation := fmt.Sprintf("provision %q class %q", claimToClaimKey(claim), claimClass)
glog.Infof(logOperation(operation, "started"))
// A previous doProvisionClaim may just have finished while we were waiting for
// the locks. Check that PV (with deterministic name) hasn't been provisioned
// yet.
pvName := ctrl.getProvisionedVolumeNameForClaim(claim)
volume, err := ctrl.client.CoreV1().PersistentVolumes().Get(pvName, metav1.GetOptions{})
if err == nil && volume != nil {
// Volume has been already provisioned, nothing to do.
glog.Infof(logOperation(operation, "persistentvolume %q already exists, skipping", pvName))
return nil
}
...
}
2、获取StorageClass对象中的Provisioner和ReclaimPolicy参数,如果provisionerName和StorageClass对象中的provisioner字段不一致则报错并退出执行。
provisioner, parameters, err := ctrl.getStorageClassFields(claimClass)
if err != nil {
glog.Errorf(logOperation(operation, "error getting claim's StorageClass's fields: %v", err))
return nil
}
if provisioner != ctrl.provisionerName {
// class.Provisioner has either changed since shouldProvision() or
// annDynamicallyProvisioned contains different provisioner than
// class.Provisioner.
glog.Errorf(logOperation(operation, "unknown provisioner %q requested in claim's StorageClass", provisioner))
return nil
}
// Check if this provisioner can provision this claim.
if err = ctrl.canProvision(claim); err != nil {
ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningFailed", err.Error())
glog.Errorf(logOperation(operation, "failed to provision volume: %v", err))
return nil
}
reclaimPolicy := v1.PersistentVolumeReclaimDelete
if ctrl.kubeVersion.AtLeast(utilversion.MustParseSemantic("v1.8.0")) {
reclaimPolicy, err = ctrl.fetchReclaimPolicy(claimClass)
if err != nil {
return err
}
}
3、执行具体的provisioner.Provision方法,构建PV对象,例如本文中的provisioner是nfs-provisioner。
options := VolumeOptions{
PersistentVolumeReclaimPolicy: reclaimPolicy,
PVName: pvName,
PVC: claim,
MountOptions: mountOptions,
Parameters: parameters,
SelectedNode: selectedNode,
AllowedTopologies: allowedTopologies,
}
ctrl.eventRecorder.Event(claim, v1.EventTypeNormal, "Provisioning", fmt.Sprintf("External provisioner is provisioning volume for claim %q", claimToClaimKey(claim)))
volume, err = ctrl.provisioner.Provision(options)
if err != nil {
if ierr, ok := err.(*IgnoredError); ok {
// Provision ignored, do nothing and hope another provisioner will provision it.
glog.Infof(logOperation(operation, "volume provision ignored: %v", ierr))
return nil
}
err = fmt.Errorf("failed to provision volume with StorageClass %q: %v", claimClass, err)
ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningFailed", err.Error())
return err
}
4、创建k8s的PV对象。
// Try to create the PV object several times
for i := 0; i < ctrl.createProvisionedPVRetryCount; i++ {
glog.Infof(logOperation(operation, "trying to save persistentvvolume %q", volume.Name))
if _, err = ctrl.client.CoreV1().PersistentVolumes().Create(volume); err == nil || apierrs.IsAlreadyExists(err) {
// Save succeeded.
if err != nil {
glog.Infof(logOperation(operation, "persistentvolume %q already exists, reusing", volume.Name))
err = nil
} else {
glog.Infof(logOperation(operation, "persistentvolume %q saved", volume.Name))
}
break
}
// Save failed, try again after a while.
glog.Infof(logOperation(operation, "failed to save persistentvolume %q: %v", volume.Name, err))
time.Sleep(ctrl.createProvisionedPVInterval)
}
5、创建PV失败,清理存储资源。
if err != nil {
// Save failed. Now we have a storage asset outside of Kubernetes,
// but we don't have appropriate PV object for it.
// Emit some event here and try to delete the storage asset several
// times.
...
for i := 0; i < ctrl.createProvisionedPVRetryCount; i++ {
if err = ctrl.provisioner.Delete(volume); err == nil {
// Delete succeeded
glog.Infof(logOperation(operation, "cleaning volume %q succeeded", volume.Name))
break
}
// Delete failed, try again after a while.
glog.Infof(logOperation(operation, "failed to clean volume %q: %v", volume.Name, err))
time.Sleep(ctrl.createProvisionedPVInterval)
}
if err != nil {
// Delete failed several times. There is an orphaned volume and there
// is nothing we can do about it.
strerr := fmt.Sprintf("Error cleaning provisioned volume for claim %s: %v. Please delete manually.", claimToClaimKey(claim), err)
glog.Error(logOperation(operation, strerr))
ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningCleanupFailed", strerr)
}
}
如果创建成功,则打印成功的日志,并返回nil。
3.4.2. deleteVolumeOperation
1、deleteVolumeOperation入参是PV,先获得PV对象,并判断是否需要删除。
// deleteVolumeOperation attempts to delete the volume backing the given
// volume. Returns error, which indicates whether deletion should be retried
// (requeue the volume) or not
func (ctrl *ProvisionController) deleteVolumeOperation(volume *v1.PersistentVolume) error {
...
// This method may have been waiting for a volume lock for some time.
// Our check does not have to be as sophisticated as PV controller's, we can
// trust that the PV controller has set the PV to Released/Failed and it's
// ours to delete
newVolume, err := ctrl.client.CoreV1().PersistentVolumes().Get(volume.Name, metav1.GetOptions{})
if err != nil {
return nil
}
if !ctrl.shouldDelete(newVolume) {
glog.Infof(logOperation(operation, "persistentvolume no longer needs deletion, skipping"))
return nil
}
...
}
2、调用具体的provisioner的Delete方法,例如,如果是nfs-provisioner,则是调用nfs-provisioner的Delete方法。
err = ctrl.provisioner.Delete(volume)
if err != nil {
if ierr, ok := err.(*IgnoredError); ok {
// Delete ignored, do nothing and hope another provisioner will delete it.
glog.Infof(logOperation(operation, "volume deletion ignored: %v", ierr))
return nil
}
// Delete failed, emit an event.
glog.Errorf(logOperation(operation, "volume deletion failed: %v", err))
ctrl.eventRecorder.Event(volume, v1.EventTypeWarning, "VolumeFailedDelete", err.Error())
return err
}
3、删除k8s中的PV对象。
// Delete the volume
if err = ctrl.client.CoreV1().PersistentVolumes().Delete(volume.Name, nil); err != nil {
// Oops, could not delete the volume and therefore the controller will
// try to delete the volume again on next update.
glog.Infof(logOperation(operation, "failed to delete persistentvolume: %v", err))
return err
}
4. 总结
Provisioner接口包含Provision和Delete两个方法,自定义的provisioner需要实现这两个方法,这两个方法只是处理了跟存储类型相关的事项,并没有针对PV、PVC对象的增删等操作。Provision方法主要用来构造PV对象,不同类型的Provisioner的,一般是PersistentVolumeSource类型和参数不同,例如nfs-provisioner对应的PersistentVolumeSource为NFS,并且需要传入NFS相关的参数:Server,Path等。Delete方法主要针对对应的存储类型,做数据存档(备份)或删除的处理。StorageClass对象需要单独创建,用来指定具体的provisioner来执行相关逻辑。provisionClaimOperation和deleteVolumeOperation具体执行了k8s中PV对象的创建和删除操作,同时调用了具体provisioner的Provision和Delete两个方法来对存储数据做处理。
参考文章
- https://github.com/kubernetes-incubator/external-storage/tree/master/docs/demo/hostpath-provisioner
- https://github.com/kubernetes-incubator/external-storage/tree/master/nfs-client
- https://github.com/kubernetes-incubator/external-storage/blob/master/lib/controller/controller.go
- https://github.com/kubernetes-incubator/external-storage/blob/master/lib/controller/volume.go
10 - 问题排查
10.1 - 节点问题
10.1.1 - keycreate permission denied
问题描述
write /proc/self/attr/keycreate: permission denied
具体报错:
kuberuntime_manager.go:758] createPodSandbox for pod "ecc-hostpath-provisioner-8jbhf_kube-system(b8050fd3-4ffe-11eb-a82e-c6090b53405b)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "ecc-hostpath-provisioner-8jbhf": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write /proc/self/attr/keycreate: permission denied\"": unknown
解决办法
SELINUX未设置成disabled
# 将SELINUX设置成disabled
setenforce 0 # 临时生效
# 永久生效,但需重启,配合上述命令可以不用立即重启
sed -i "s/SELINUX=enforcing/SELINUX=disabled/g" /etc/selinux/config
# 查看SELinux状态
$ /usr/sbin/sestatus -v
SELinux status: disabled
$ getenforce
Disabled
10.1.2 - Cgroup不支持pid资源
问题描述
机器内核版本较低,kubelet启动异常,报错如下:
Failed to start ContainerManager failed to initialize top level QOS containers: failed to update top level Burstable QOS cgroup : failed to set supported cgroup subsystems for cgroup [kubepods burstable]: Failed to find subsystem mount for required subsystem: pids
原因分析
低版本内核的cgroup不支持pids资源的功能,
cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 5 6 1
cpu 2 76 1
cpuacct 2 76 1
memory 4 76 1
devices 10 76 1
freezer 7 6 1
net_cls 3 6 1
blkio 8 76 1
perf_event 9 6 1
hugetlb 6 6 1
正常机器的cgroup
root@host:~# cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 5 17 1
cpu 7 80 1
cpuacct 7 80 1
memory 12 80 1
devices 10 80 1
freezer 2 17 1
net_cls 4 17 1
blkio 8 80 1
perf_event 6 17 1
hugetlb 11 17 1
pids 3 80 1 # 此处支持pids资源
oom 9 1 1
解决方案
1、升级内核版本,使得cgroup支持pids资源。
或者
2、将kubelet的启动参数添加 SupportPodPidsLimit=false,SupportNodePidsLimit=false
vi /etc/systemd/system/kubelet.service
# 添加 kubelet 启动参数
--feature-gates=... ,SupportPodPidsLimit=false,SupportNodePidsLimit=false \
systemctl daemon-reload && systemctl restart kubelet.service
文档参考:
10.1.3 - Cgroup子系统无法挂载
问题描述
内核版本: 5.4.56-200.el7.x86_64
docker报错
May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.565235530+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.565525512+08:00" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.574734345+08:00" level=warning msg="Your kernel does not support CPU realtime scheduler"
May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.574792864+08:00" level=warning msg="Your kernel does not support cgroup blkio weight"
May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.574800326+08:00" level=warning msg="Your kernel does not support cgroup blkio weight_device"
kubelet报错

解决
cgroup问题解决:
1、curl https://pi-ops.oss-cn-hangzhou.aliyuncs.com/scripts/cgroupfs-mount.sh | bash
2、重启设备即可解决
10.2 - Pod驱逐
问题描述
节点Pod被驱逐
原因
1. 查看节点和该节点pod状态
查看节点状态为Ready,查看该节点的所有pod,发现存在被驱逐的pod和nvidia-device-plugin为pending
root@host:~$ kgpoallowide |grep 192.168.1.1
department-56 173e397c-ea35-4aac-85d8-07106e55d7b7 0/1 Evicted 0 52d <none> 192.168.1.1 <none>
kube-system nvidia-device-plugin-daemonset-d58d2 0/1 Pending 0 1s <none> 192.168.1.1 <none>
2. 查看对应节点kubelet的日志
0905 15:42:13.182280 23506 eviction_manager.go:142] Failed to admit pod rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432) - node has conditions: [DiskPressure]
I0905 15:42:14.827343 23506 kubelet.go:1836] SyncLoop (ADD, "api"): "nvidia-device-plugin-daemonset-88sm6_kube-system(adbd9227-cfb0-11e9-9729-6c92bf5e2432)"
W0905 15:42:14.827372 23506 eviction_manager.go:142] Failed to admit pod nvidia-device-plugin-daemonset-88sm6_kube-system(adbd9227-cfb0-11e9-9729-6c92bf5e2432) - node has conditions: [DiskPressure]
I0905 15:42:15.722378 23506 kubelet_node_status.go:607] Update capacity for nvidia.com/gpu-share to 0
I0905 15:42:16.692488 23506 kubelet.go:1852] SyncLoop (DELETE, "api"): "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)"
W0905 15:42:16.698445 23506 status_manager.go:489] Failed to delete status for pod "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)": pod "rdma-device-plugin-daemonset-8nwb8" not found
I0905 15:42:16.698490 23506 kubelet.go:1846] SyncLoop (REMOVE, "api"): "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)"
I0905 15:42:16.699267 23506 kubelet.go:2040] Failed to delete pod "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)", err: pod not found
W0905 15:42:16.777355 23506 eviction_manager.go:332] eviction manager: attempting to reclaim nodefs
I0905 15:42:16.777384 23506 eviction_manager.go:346] eviction manager: must evict pod(s) to reclaim nodefs
E0905 15:42:16.777390 23506 eviction_manager.go:357] eviction manager: eviction thresholds have been met, but no pods are active to evict
存在关于pod驱逐相关的日志,驱逐的原因为node has conditions: [DiskPressure]。
3. 查看磁盘相关信息
[root@host /]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 126G 0 126G 0% /dev
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 126G 27M 126G 1% /run
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/sda1 20G 19G 0 100% / # 根目录磁盘满
/dev/nvme1n1 3.0T 191G 2.8T 7% /data2
/dev/nvme0n1 3.0T 1.3T 1.7T 44% /data1
/dev/sda4 182G 95G 87G 53% /data
/dev/sda3 20G 3.8G 15G 20% /usr/local
tmpfs 26G 0 26G 0% /run/user/0
发现根目录的磁盘盘,接着查看哪些文件占用磁盘。
[root@host ~/kata]# du -sh ./*
1.0M ./log
944K ./netlink
6.6G ./kernel3
/var/log/下存在7G 的日志。清理相关日志和无用文件后,根目录恢复空间。
[root@host /data]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 126G 0 126G 0% /dev
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 126G 27M 126G 1% /run
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/sda1 20G 5.8G 13G 32% / # 根目录正常
/dev/nvme1n1 3.0T 191G 2.8T 7% /data2
查看节点pod状态,相关plugin的pod恢复正常。
root@host:~$ kgpoallowide |grep 192.168.1.1
kube-system nvidia-device-plugin-daemonset-h4pjc 1/1 Running 0 16m 192.168.1.1 192.168.1.1 <none>
kube-system rdma-device-plugin-daemonset-xlkbv 1/1 Running 0 16m 192.168.1.1 192.168.1.1 <none>
4. 查看kubelet配置
查看kubelet关于pod驱逐相关的参数配置,可见节点kubelet开启了驱逐机制,正常情况下该配置应该是关闭的。
ExecStart=/usr/local/bin/kubelet \
...
--eviction-hard=nodefs.available<1% \
解决方案
总结以上原因为,kubelet开启了pod驱逐的机制,根目录的磁盘达到100%,pod被驱逐,且无法再正常创建在该节点。
解决方案如下:
1、关闭kubelet的驱逐机制。
2、清除根目录的文件,恢复根目录空间,并后续增加根目录的磁盘监控。
10.3 - 镜像拉取失败问题
常见镜像拉取问题排查
1. Pod状态为ErrImagePull或ImagePullBackOff
docker-hub-75d4dfb984-5hggg 0/1 ImagePullBackOff 0 14m 192.168.1.30 <node ip>
docker-hub-75d4dfb984-9r57b 0/1 ErrImagePull 0 53s 192.168.0.42 <node ip>
- ErrImagePull:表示pod已经调度到node节点,kubelet调用docker去拉取镜像失败。
- ImagePullBackOff:表示kubelet拉取镜像失败后,不断重试去拉取仍然失败。
2. 查看pod的事件
通过kubectl describe pod 命令查看pod事件,该事件的报错信息在kubelet或docker的日志中也会查看到。
2.1. http: server gave HTTP response to HTTPS client
如果遇到以下报错,尝试将该镜像仓库添加到docker可信任的镜像仓库配置中。
Error getting v2 registry: Get https://docker.com:8080/v2/: http: server gave HTTP response to HTTPS client"
具体操作是修改/etc/docker/daemon.json的insecure-registries参数
#cat /etc/docker/daemon.json
{
...
"insecure-registries": [
...
"docker.com:8080"
],
...
}
2.2. no basic auth credentials
如果遇到no basic auth credentials报错,说明kubelet调用docker接口去拉取镜像时,镜像仓库的认证信息失败。
Normal BackOff 18s kubelet, 192.168.1.1 Back-off pulling image "docker.com:8080/public/2048:latest"
Warning Failed 18s kubelet, 192.168.1.1 Error: ImagePullBackOff
Normal Pulling 5s (x2 over 18s) kubelet, 192.168.1.1 Pulling image "docker.com:8080/public/2048:latest"
Warning Failed 5s (x2 over 18s) kubelet, 192.168.1.1 Failed to pull image "docker.com:8080/public/2048:latest": rpc error: code = Unknown desc = Error response from daemon: Get http://docker.com:8080/v2/public/2048/manifests/latest: no basic auth credentials
Warning Failed 5s (x2 over 18s) kubelet, 192.168.1.1 Error: ErrImagePull
具体操作,在拉取镜像失败的节点上登录该镜像仓库,认证信息会更新到 $HOME/.docker/config.json文件中。将该文件拷贝到/var/lib/kubelet/config.json中。
10.4 - PVC Terminating
问题描述
pvc terminating
pvc在删除时,卡在terminating中。
解决方法
kubectl patch pvc {PVC_NAME} -p '{"metadata":{"finalizers":null}}'
11 - 源码分析
11.1 -
详见:Kubernetes源码分析笔记
11.2 -
11.2.1 -
kube-apiserver源码分析(一)之 NewAPIServerCommand
以下代码分析基于
kubernetes v1.12.0版本。
本文主要分析kube-apiserver中cmd部分的代码,即NewAPIServerCommand相关的代码。更多具体的逻辑待后续文章分析。
kube-apiserver的cmd部分目录代码结构如下:
kube-apiserver
├── apiserver.go # kube-apiserver的main入口
└── app
├── aggregator.go
├── apiextensions.go
├── options # 初始化kube-apiserver使用到的option
│ ├── options.go # 包括:NewServerRunOptions、Flags等
│ ├── options_test.go
│ └── validation.go
├── server.go # 包括:NewAPIServerCommand、Run、CreateServerChain、Complete等
1. Main
此部分代码位于cmd/kube-apiserver/apiserver.go
func main() {
rand.Seed(time.Now().UTC().UnixNano())
command := app.NewAPIServerCommand(server.SetupSignalHandler())
// TODO: once we switch everything over to Cobra commands, we can go back to calling
// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the
// normalize func and add the go flag set by hand.
pflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc)
pflag.CommandLine.AddGoFlagSet(goflag.CommandLine)
// utilflag.InitFlags()
logs.InitLogs()
defer logs.FlushLogs()
if err := command.Execute(); err != nil {
fmt.Fprintf(os.Stderr, "error: %v\n", err)
os.Exit(1)
}
}
核心代码:
// 初始化APIServerCommand
command := app.NewAPIServerCommand(server.SetupSignalHandler())
// 执行Execute
err := command.Execute()
2. NewAPIServerCommand
此部分的代码位于/cmd/kube-apiserver/app/server.go
NewAPIServerCommand即Cobra命令行框架的构造函数,主要包括三部分:
- 构造option
- 添加Flags
- 执行Run函数
完整代码如下:
此部分代码位于cmd/kube-apiserver/app/server.go
// NewAPIServerCommand creates a *cobra.Command object with default parameters
func NewAPIServerCommand(stopCh <-chan struct{}) *cobra.Command {
s := options.NewServerRunOptions()
cmd := &cobra.Command{
Use: "kube-apiserver",
Long: `The Kubernetes API server validates and configures data
for the api objects which include pods, services, replicationcontrollers, and
others. The API Server services REST operations and provides the frontend to the
cluster's shared state through which all other components interact.`,
RunE: func(cmd *cobra.Command, args []string) error {
verflag.PrintAndExitIfRequested()
utilflag.PrintFlags(cmd.Flags())
// set default options
completedOptions, err := Complete(s)
if err != nil {
return err
}
// validate options
if errs := completedOptions.Validate(); len(errs) != 0 {
return utilerrors.NewAggregate(errs)
}
return Run(completedOptions, stopCh)
},
}
fs := cmd.Flags()
namedFlagSets := s.Flags()
for _, f := range namedFlagSets.FlagSets {
fs.AddFlagSet(f)
}
usageFmt := "Usage:\n %s\n"
cols, _, _ := apiserverflag.TerminalSize(cmd.OutOrStdout())
cmd.SetUsageFunc(func(cmd *cobra.Command) error {
fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())
apiserverflag.PrintSections(cmd.OutOrStderr(), namedFlagSets, cols)
return nil
})
cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine())
apiserverflag.PrintSections(cmd.OutOrStdout(), namedFlagSets, cols)
})
return cmd
}
核心代码:
// 构造option
s := options.NewServerRunOptions()
// 添加flags
fs := cmd.Flags()
namedFlagSets := s.Flags()
for _, f := range namedFlagSets.FlagSets {
fs.AddFlagSet(f)
}
// set default options
completedOptions, err := Complete(s)
// Run
Run(completedOptions, stopCh)
3. NewServerRunOptions
NewServerRunOptions基于默认的参数构造ServerRunOptions结构体。ServerRunOptions是apiserver运行的配置信息。具体结构体定义如下。
3.1. ServerRunOptions
其中主要的配置如下:
- GenericServerRunOptions
- Etcd
- SecureServing
- KubeletConfig
- ...
// ServerRunOptions runs a kubernetes api server.
type ServerRunOptions struct {
GenericServerRunOptions *genericoptions.ServerRunOptions
Etcd *genericoptions.EtcdOptions
SecureServing *genericoptions.SecureServingOptionsWithLoopback
InsecureServing *genericoptions.DeprecatedInsecureServingOptionsWithLoopback
Audit *genericoptions.AuditOptions
Features *genericoptions.FeatureOptions
Admission *kubeoptions.AdmissionOptions
Authentication *kubeoptions.BuiltInAuthenticationOptions
Authorization *kubeoptions.BuiltInAuthorizationOptions
CloudProvider *kubeoptions.CloudProviderOptions
StorageSerialization *kubeoptions.StorageSerializationOptions
APIEnablement *genericoptions.APIEnablementOptions
AllowPrivileged bool
EnableLogsHandler bool
EventTTL time.Duration
KubeletConfig kubeletclient.KubeletClientConfig
KubernetesServiceNodePort int
MaxConnectionBytesPerSec int64
ServiceClusterIPRange net.IPNet // TODO: make this a list
ServiceNodePortRange utilnet.PortRange
SSHKeyfile string
SSHUser string
ProxyClientCertFile string
ProxyClientKeyFile string
EnableAggregatorRouting bool
MasterCount int
EndpointReconcilerType string
ServiceAccountSigningKeyFile string
}
3.2. NewServerRunOptions
NewServerRunOptions初始化配置结构体。
// NewServerRunOptions creates a new ServerRunOptions object with default parameters
func NewServerRunOptions() *ServerRunOptions {
s := ServerRunOptions{
GenericServerRunOptions: genericoptions.NewServerRunOptions(),
Etcd: genericoptions.NewEtcdOptions(storagebackend.NewDefaultConfig(kubeoptions.DefaultEtcdPathPrefix, nil)),
SecureServing: kubeoptions.NewSecureServingOptions(),
InsecureServing: kubeoptions.NewInsecureServingOptions(),
Audit: genericoptions.NewAuditOptions(),
Features: genericoptions.NewFeatureOptions(),
Admission: kubeoptions.NewAdmissionOptions(),
Authentication: kubeoptions.NewBuiltInAuthenticationOptions().WithAll(),
Authorization: kubeoptions.NewBuiltInAuthorizationOptions(),
CloudProvider: kubeoptions.NewCloudProviderOptions(),
StorageSerialization: kubeoptions.NewStorageSerializationOptions(),
APIEnablement: genericoptions.NewAPIEnablementOptions(),
EnableLogsHandler: true,
EventTTL: 1 * time.Hour,
MasterCount: 1,
EndpointReconcilerType: string(reconcilers.LeaseEndpointReconcilerType),
KubeletConfig: kubeletclient.KubeletClientConfig{
Port: ports.KubeletPort,
ReadOnlyPort: ports.KubeletReadOnlyPort,
PreferredAddressTypes: []string{
// --override-hostname
string(api.NodeHostName),
// internal, preferring DNS if reported
string(api.NodeInternalDNS),
string(api.NodeInternalIP),
// external, preferring DNS if reported
string(api.NodeExternalDNS),
string(api.NodeExternalIP),
},
EnableHttps: true,
HTTPTimeout: time.Duration(5) * time.Second,
},
ServiceNodePortRange: kubeoptions.DefaultServiceNodePortRange,
}
s.ServiceClusterIPRange = kubeoptions.DefaultServiceIPCIDR
// Overwrite the default for storage data format.
s.Etcd.DefaultStorageMediaType = "application/vnd.kubernetes.protobuf"
return &s
}
3.3. Complete
当kube-apiserver的flags被解析后,调用Complete完成默认配置。
此部分代码位于cmd/kube-apiserver/app/server.go
// Should be called after kube-apiserver flags parsed.
func Complete(s *options.ServerRunOptions) (completedServerRunOptions, error) {
var options completedServerRunOptions
// set defaults
if err := s.GenericServerRunOptions.DefaultAdvertiseAddress(s.SecureServing.SecureServingOptions); err != nil {
return options, err
}
if err := kubeoptions.DefaultAdvertiseAddress(s.GenericServerRunOptions, s.InsecureServing.DeprecatedInsecureServingOptions); err != nil {
return options, err
}
serviceIPRange, apiServerServiceIP, err := master.DefaultServiceIPRange(s.ServiceClusterIPRange)
if err != nil {
return options, fmt.Errorf("error determining service IP ranges: %v", err)
}
s.ServiceClusterIPRange = serviceIPRange
if err := s.SecureServing.MaybeDefaultWithSelfSignedCerts(s.GenericServerRunOptions.AdvertiseAddress.String(), []string{"kubernetes.default.svc", "kubernetes.default", "kubernetes"}, []net.IP{apiServerServiceIP}); err != nil {
return options, fmt.Errorf("error creating self-signed certificates: %v", err)
}
if len(s.GenericServerRunOptions.ExternalHost) == 0 {
if len(s.GenericServerRunOptions.AdvertiseAddress) > 0 {
s.GenericServerRunOptions.ExternalHost = s.GenericServerRunOptions.AdvertiseAddress.String()
} else {
if hostname, err := os.Hostname(); err == nil {
s.GenericServerRunOptions.ExternalHost = hostname
} else {
return options, fmt.Errorf("error finding host name: %v", err)
}
}
glog.Infof("external host was not specified, using %v", s.GenericServerRunOptions.ExternalHost)
}
s.Authentication.ApplyAuthorization(s.Authorization)
// Use (ServiceAccountSigningKeyFile != "") as a proxy to the user enabling
// TokenRequest functionality. This defaulting was convenient, but messed up
// a lot of people when they rotated their serving cert with no idea it was
// connected to their service account keys. We are taking this oppurtunity to
// remove this problematic defaulting.
if s.ServiceAccountSigningKeyFile == "" {
// Default to the private server key for service account token signing
if len(s.Authentication.ServiceAccounts.KeyFiles) == 0 && s.SecureServing.ServerCert.CertKey.KeyFile != "" {
if kubeauthenticator.IsValidServiceAccountKeyFile(s.SecureServing.ServerCert.CertKey.KeyFile) {
s.Authentication.ServiceAccounts.KeyFiles = []string{s.SecureServing.ServerCert.CertKey.KeyFile}
} else {
glog.Warning("No TLS key provided, service account token authentication disabled")
}
}
}
if s.Etcd.StorageConfig.DeserializationCacheSize == 0 {
// When size of cache is not explicitly set, estimate its size based on
// target memory usage.
glog.V(2).Infof("Initializing deserialization cache size based on %dMB limit", s.GenericServerRunOptions.TargetRAMMB)
// This is the heuristics that from memory capacity is trying to infer
// the maximum number of nodes in the cluster and set cache sizes based
// on that value.
// From our documentation, we officially recommend 120GB machines for
// 2000 nodes, and we scale from that point. Thus we assume ~60MB of
// capacity per node.
// TODO: We may consider deciding that some percentage of memory will
// be used for the deserialization cache and divide it by the max object
// size to compute its size. We may even go further and measure
// collective sizes of the objects in the cache.
clusterSize := s.GenericServerRunOptions.TargetRAMMB / 60
s.Etcd.StorageConfig.DeserializationCacheSize = 25 * clusterSize
if s.Etcd.StorageConfig.DeserializationCacheSize < 1000 {
s.Etcd.StorageConfig.DeserializationCacheSize = 1000
}
}
if s.Etcd.EnableWatchCache {
glog.V(2).Infof("Initializing cache sizes based on %dMB limit", s.GenericServerRunOptions.TargetRAMMB)
sizes := cachesize.NewHeuristicWatchCacheSizes(s.GenericServerRunOptions.TargetRAMMB)
if userSpecified, err := serveroptions.ParseWatchCacheSizes(s.Etcd.WatchCacheSizes); err == nil {
for resource, size := range userSpecified {
sizes[resource] = size
}
}
s.Etcd.WatchCacheSizes, err = serveroptions.WriteWatchCacheSizes(sizes)
if err != nil {
return options, err
}
}
// TODO: remove when we stop supporting the legacy group version.
if s.APIEnablement.RuntimeConfig != nil {
for key, value := range s.APIEnablement.RuntimeConfig {
if key == "v1" || strings.HasPrefix(key, "v1/") ||
key == "api/v1" || strings.HasPrefix(key, "api/v1/") {
delete(s.APIEnablement.RuntimeConfig, key)
s.APIEnablement.RuntimeConfig["/v1"] = value
}
if key == "api/legacy" {
delete(s.APIEnablement.RuntimeConfig, key)
}
}
}
options.ServerRunOptions = s
return options, nil
}
3. AddFlagSet
AddFlagSet主要的作用是通过外部传入的flag的具体值,解析的时候传递给option的结构体,最终给apiserver使用。
其中NewAPIServerCommand关于AddFlagSet的相关代码如下:
fs := cmd.Flags()
namedFlagSets := s.Flags()
for _, f := range namedFlagSets.FlagSets {
fs.AddFlagSet(f)
}
3.1. Flags
Flags完整代码如下:
此部分代码位于cmd/kube-apiserver/app/options/options.go
// Flags returns flags for a specific APIServer by section name
func (s *ServerRunOptions) Flags() (fss apiserverflag.NamedFlagSets) {
// Add the generic flags.
s.GenericServerRunOptions.AddUniversalFlags(fss.FlagSet("generic"))
s.Etcd.AddFlags(fss.FlagSet("etcd"))
s.SecureServing.AddFlags(fss.FlagSet("secure serving"))
s.InsecureServing.AddFlags(fss.FlagSet("insecure serving"))
s.InsecureServing.AddUnqualifiedFlags(fss.FlagSet("insecure serving")) // TODO: remove it until kops stops using `--address`
s.Audit.AddFlags(fss.FlagSet("auditing"))
s.Features.AddFlags(fss.FlagSet("features"))
s.Authentication.AddFlags(fss.FlagSet("authentication"))
s.Authorization.AddFlags(fss.FlagSet("authorization"))
s.CloudProvider.AddFlags(fss.FlagSet("cloud provider"))
s.StorageSerialization.AddFlags(fss.FlagSet("storage"))
s.APIEnablement.AddFlags(fss.FlagSet("api enablement"))
s.Admission.AddFlags(fss.FlagSet("admission"))
// Note: the weird ""+ in below lines seems to be the only way to get gofmt to
// arrange these text blocks sensibly. Grrr.
fs := fss.FlagSet("misc")
fs.DurationVar(&s.EventTTL, "event-ttl", s.EventTTL,
"Amount of time to retain events.")
fs.BoolVar(&s.AllowPrivileged, "allow-privileged", s.AllowPrivileged,
"If true, allow privileged containers. [default=false]")
fs.BoolVar(&s.EnableLogsHandler, "enable-logs-handler", s.EnableLogsHandler,
"If true, install a /logs handler for the apiserver logs.")
// Deprecated in release 1.9
fs.StringVar(&s.SSHUser, "ssh-user", s.SSHUser,
"If non-empty, use secure SSH proxy to the nodes, using this user name")
fs.MarkDeprecated("ssh-user", "This flag will be removed in a future version.")
// Deprecated in release 1.9
fs.StringVar(&s.SSHKeyfile, "ssh-keyfile", s.SSHKeyfile,
"If non-empty, use secure SSH proxy to the nodes, using this user keyfile")
fs.MarkDeprecated("ssh-keyfile", "This flag will be removed in a future version.")
fs.Int64Var(&s.MaxConnectionBytesPerSec, "max-connection-bytes-per-sec", s.MaxConnectionBytesPerSec, ""+
"If non-zero, throttle each user connection to this number of bytes/sec. "+
"Currently only applies to long-running requests.")
fs.IntVar(&s.MasterCount, "apiserver-count", s.MasterCount,
"The number of apiservers running in the cluster, must be a positive number. (In use when --endpoint-reconciler-type=master-count is enabled.)")
fs.StringVar(&s.EndpointReconcilerType, "endpoint-reconciler-type", string(s.EndpointReconcilerType),
"Use an endpoint reconciler ("+strings.Join(reconcilers.AllTypes.Names(), ", ")+")")
// See #14282 for details on how to test/try this option out.
// TODO: remove this comment once this option is tested in CI.
fs.IntVar(&s.KubernetesServiceNodePort, "kubernetes-service-node-port", s.KubernetesServiceNodePort, ""+
"If non-zero, the Kubernetes master service (which apiserver creates/maintains) will be "+
"of type NodePort, using this as the value of the port. If zero, the Kubernetes master "+
"service will be of type ClusterIP.")
fs.IPNetVar(&s.ServiceClusterIPRange, "service-cluster-ip-range", s.ServiceClusterIPRange, ""+
"A CIDR notation IP range from which to assign service cluster IPs. This must not "+
"overlap with any IP ranges assigned to nodes for pods.")
fs.Var(&s.ServiceNodePortRange, "service-node-port-range", ""+
"A port range to reserve for services with NodePort visibility. "+
"Example: '30000-32767'. Inclusive at both ends of the range.")
// Kubelet related flags:
fs.BoolVar(&s.KubeletConfig.EnableHttps, "kubelet-https", s.KubeletConfig.EnableHttps,
"Use https for kubelet connections.")
fs.StringSliceVar(&s.KubeletConfig.PreferredAddressTypes, "kubelet-preferred-address-types", s.KubeletConfig.PreferredAddressTypes,
"List of the preferred NodeAddressTypes to use for kubelet connections.")
fs.UintVar(&s.KubeletConfig.Port, "kubelet-port", s.KubeletConfig.Port,
"DEPRECATED: kubelet port.")
fs.MarkDeprecated("kubelet-port", "kubelet-port is deprecated and will be removed.")
fs.UintVar(&s.KubeletConfig.ReadOnlyPort, "kubelet-read-only-port", s.KubeletConfig.ReadOnlyPort,
"DEPRECATED: kubelet port.")
fs.DurationVar(&s.KubeletConfig.HTTPTimeout, "kubelet-timeout", s.KubeletConfig.HTTPTimeout,
"Timeout for kubelet operations.")
fs.StringVar(&s.KubeletConfig.CertFile, "kubelet-client-certificate", s.KubeletConfig.CertFile,
"Path to a client cert file for TLS.")
fs.StringVar(&s.KubeletConfig.KeyFile, "kubelet-client-key", s.KubeletConfig.KeyFile,
"Path to a client key file for TLS.")
fs.StringVar(&s.KubeletConfig.CAFile, "kubelet-certificate-authority", s.KubeletConfig.CAFile,
"Path to a cert file for the certificate authority.")
// TODO: delete this flag in 1.13
repair := false
fs.BoolVar(&repair, "repair-malformed-updates", false, "deprecated")
fs.MarkDeprecated("repair-malformed-updates", "This flag will be removed in a future version")
fs.StringVar(&s.ProxyClientCertFile, "proxy-client-cert-file", s.ProxyClientCertFile, ""+
"Client certificate used to prove the identity of the aggregator or kube-apiserver "+
"when it must call out during a request. This includes proxying requests to a user "+
"api-server and calling out to webhook admission plugins. It is expected that this "+
"cert includes a signature from the CA in the --requestheader-client-ca-file flag. "+
"That CA is published in the 'extension-apiserver-authentication' configmap in "+
"the kube-system namespace. Components receiving calls from kube-aggregator should "+
"use that CA to perform their half of the mutual TLS verification.")
fs.StringVar(&s.ProxyClientKeyFile, "proxy-client-key-file", s.ProxyClientKeyFile, ""+
"Private key for the client certificate used to prove the identity of the aggregator or kube-apiserver "+
"when it must call out during a request. This includes proxying requests to a user "+
"api-server and calling out to webhook admission plugins.")
fs.BoolVar(&s.EnableAggregatorRouting, "enable-aggregator-routing", s.EnableAggregatorRouting,
"Turns on aggregator routing requests to endpoints IP rather than cluster IP.")
fs.StringVar(&s.ServiceAccountSigningKeyFile, "service-account-signing-key-file", s.ServiceAccountSigningKeyFile, ""+
"Path to the file that contains the current private key of the service account token issuer. The issuer will sign issued ID tokens with this private key. (Requires the 'TokenRequest' feature gate.)")
return fss
}
4. Run
Run以常驻的方式运行apiserver。
主要内容如下:
- 构造一个聚合的server结构体。
- 执行PrepareRun。
- 最终执行Run。
此部分代码位于cmd/kube-apiserver/app/server.go
// Run runs the specified APIServer. This should never exit.
func Run(completeOptions completedServerRunOptions, stopCh <-chan struct{}) error {
// To help debugging, immediately log version
glog.Infof("Version: %+v", version.Get())
server, err := CreateServerChain(completeOptions, stopCh)
if err != nil {
return err
}
return server.PrepareRun().Run(stopCh)
}
4.1. CreateServerChain
构造聚合的Server。
基本流程如下:
- 首先生成config对象,包括
kubeAPIServerConfig、apiExtensionsConfig。 - 再通过config生成server对象,包括
apiExtensionsServer、kubeAPIServer。 - 执行
apiExtensionsServer、kubeAPIServer的PrepareRun部分。 - 生成聚合的config对象
aggregatorConfig。 - 基于
aggregatorConfig、kubeAPIServer、apiExtensionsServer生成聚合的serveraggregatorServer。
此部分代码位于cmd/kube-apiserver/app/server.go
// CreateServerChain creates the apiservers connected via delegation.
func CreateServerChain(completedOptions completedServerRunOptions, stopCh <-chan struct{}) (*genericapiserver.GenericAPIServer, error) {
nodeTunneler, proxyTransport, err := CreateNodeDialer(completedOptions)
if err != nil {
return nil, err
}
kubeAPIServerConfig, insecureServingInfo, serviceResolver, pluginInitializer, admissionPostStartHook, err := CreateKubeAPIServerConfig(completedOptions, nodeTunneler, proxyTransport)
if err != nil {
return nil, err
}
// If additional API servers are added, they should be gated.
apiExtensionsConfig, err := createAPIExtensionsConfig(*kubeAPIServerConfig.GenericConfig, kubeAPIServerConfig.ExtraConfig.VersionedInformers, pluginInitializer, completedOptions.ServerRunOptions, completedOptions.MasterCount)
if err != nil {
return nil, err
}
apiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegate())
if err != nil {
return nil, err
}
kubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer, admissionPostStartHook)
if err != nil {
return nil, err
}
// otherwise go down the normal path of standing the aggregator up in front of the API server
// this wires up openapi
kubeAPIServer.GenericAPIServer.PrepareRun()
// This will wire up openapi for extension api server
apiExtensionsServer.GenericAPIServer.PrepareRun()
// aggregator comes last in the chain
aggregatorConfig, err := createAggregatorConfig(*kubeAPIServerConfig.GenericConfig, completedOptions.ServerRunOptions, kubeAPIServerConfig.ExtraConfig.VersionedInformers, serviceResolver, proxyTransport, pluginInitializer)
if err != nil {
return nil, err
}
aggregatorServer, err := createAggregatorServer(aggregatorConfig, kubeAPIServer.GenericAPIServer, apiExtensionsServer.Informers)
if err != nil {
// we don't need special handling for innerStopCh because the aggregator server doesn't create any go routines
return nil, err
}
if insecureServingInfo != nil {
insecureHandlerChain := kubeserver.BuildInsecureHandlerChain(aggregatorServer.GenericAPIServer.UnprotectedHandler(), kubeAPIServerConfig.GenericConfig)
if err := insecureServingInfo.Serve(insecureHandlerChain, kubeAPIServerConfig.GenericConfig.RequestTimeout, stopCh); err != nil {
return nil, err
}
}
return aggregatorServer.GenericAPIServer, nil
}
4.2. PrepareRun
PrepareRun主要执行一些API安装操作。
此部分的代码位于vendor/k8s.io/apiserver/pkg/server/genericapiserver.go
// PrepareRun does post API installation setup steps.
func (s *GenericAPIServer) PrepareRun() preparedGenericAPIServer {
if s.swaggerConfig != nil {
routes.Swagger{Config: s.swaggerConfig}.Install(s.Handler.GoRestfulContainer)
}
if s.openAPIConfig != nil {
routes.OpenAPI{
Config: s.openAPIConfig,
}.Install(s.Handler.GoRestfulContainer, s.Handler.NonGoRestfulMux)
}
s.installHealthz()
// Register audit backend preShutdownHook.
if s.AuditBackend != nil {
s.AddPreShutdownHook("audit-backend", func() error {
s.AuditBackend.Shutdown()
return nil
})
}
return preparedGenericAPIServer{s}
}
4.3. preparedGenericAPIServer.Run
preparedGenericAPIServer.Run运行一个安全的http server。具体的实现逻辑待后续文章分析。
此部分代码位于vendor/k8s.io/apiserver/pkg/server/genericapiserver.go
// Run spawns the secure http server. It only returns if stopCh is closed
// or the secure port cannot be listened on initially.
func (s preparedGenericAPIServer) Run(stopCh <-chan struct{}) error {
err := s.NonBlockingRun(stopCh)
if err != nil {
return err
}
<-stopCh
err = s.RunPreShutdownHooks()
if err != nil {
return err
}
// Wait for all requests to finish, which are bounded by the RequestTimeout variable.
s.HandlerChainWaitGroup.Wait()
return nil
}
核心函数:
err := s.NonBlockingRun(stopCh)
preparedGenericAPIServer.Run主要是调用NonBlockingRun函数,最终运行一个http server。该部分逻辑待后续文章分析。
5. 总结
NewAPIServerCommand采用了Cobra命令行框架,该框架使用主要包含以下部分:
- 构造option参数,提供给执行主体(例如 本文的server)作为配置参数使用。
- 添加Flags,主要用来通过传入的flags参数最终解析成option中使用的结构体属性。
- 执行Run函数,执行主体的运行逻辑部分(核心部分)。
其中Run函数的主要内容如下:
- 构造一个聚合的server结构体。
- 执行PrepareRun。
- 最终执行preparedGenericAPIServer.Run。
preparedGenericAPIServer.Run主要是调用NonBlockingRun函数,最终运行一个http server。NonBlockingRun的具体逻辑待后续文章再单独分析。
参考:
- https://github.com/kubernetes/kubernetes/tree/v1.12.0/cmd/kube-apiserver
- https://github.com/kubernetes/kubernetes/blob/v1.12.0/cmd/kube-apiserver/app/server.go
- https://github.com/kubernetes/kubernetes/blob/v1.12.0/cmd/kube-apiserver/app/aggregator.go
- https://github.com/kubernetes/kubernetes/blob/v1.12.0/cmd/kube-apiserver/app/options/options.go
11.3 -
11.3.1 -
kube-controller-manager源码分析(二)之 DeploymentController
以下代码分析基于
kubernetes v1.12.0版本。
本文主要以deployment controller为例,分析该类controller的运行逻辑。此部分代码主要为位于pkg/controller/deployment。pkg/controller部分的代码包括了各种类型的controller的具体实现。
controller manager的pkg部分代码目录结构如下:
controller # 主要包含各种controller的具体实现
├── apis
├── bootstrap
├── certificates
├── client_builder.go
├── cloud
├── clusterroleaggregation
├── controller_ref_manager.go
├── controller_utils.go # WaitForCacheSync
├── cronjob
├── daemon
├── deployment # deployment controller
│ ├── deployment_controller.go # NewDeploymentController、Run、syncDeployment
│ ├── progress.go # syncRolloutStatus
│ ├── recreate.go # rolloutRecreate
│ ├── rollback.go # rollback
│ ├── rolling.go # rolloutRolling
│ ├── sync.go
├── disruption # disruption controller
├── endpoint
├── garbagecollector
├── history
├── job
├── lookup_cache.go
├── namespace # namespace controller
├── nodeipam
├── nodelifecycle
├── podautoscaler
├── podgc
├── replicaset # replicaset controller
├── replication # replication controller
├── resourcequota
├── route
├── service # service controller
├── serviceaccount
├── statefulset # statefulset controller
└── volume # PersistentVolumeController、AttachDetachController、PVCProtectionController
1. startDeploymentController
func startDeploymentController(ctx ControllerContext) (http.Handler, bool, error) {
if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "deployments"}] {
return nil, false, nil
}
dc, err := deployment.NewDeploymentController(
ctx.InformerFactory.Apps().V1().Deployments(),
ctx.InformerFactory.Apps().V1().ReplicaSets(),
ctx.InformerFactory.Core().V1().Pods(),
ctx.ClientBuilder.ClientOrDie("deployment-controller"),
)
if err != nil {
return nil, true, fmt.Errorf("error creating Deployment controller: %v", err)
}
go dc.Run(int(ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs), ctx.Stop)
return nil, true, nil
}
startDeploymentController主要调用的函数为NewDeploymentController和对应的Run函数。该部分逻辑在kubernetes/pkg/controller中。
2. NewDeploymentController
NewDeploymentController主要构建DeploymentController结构体。
该部分主要处理了以下逻辑:
- 构建并运行事件处理器
eventBroadcaster。 - 初始化赋值
rsControl、clientset、workqueue。 - 添加
dInformer、rsInformer、podInformer的ResourceEventHandlerFuncs,其中主要为AddFunc、UpdateFunc、DeleteFunc三类方法。 - 构造deployment、rs、pod的Informer的Lister函数和HasSynced函数。
- 调用
syncHandler,来实现syncDeployment。
2.1. eventBroadcaster
调用事件处理器来记录deployment相关的事件。
eventBroadcaster := record.NewBroadcaster()
eventBroadcaster.StartLogging(glog.Infof)
// TODO: remove the wrapper when every clients have moved to use the clientset.
eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: v1core.New(client.CoreV1().RESTClient()).Events("")})
2.2. rsControl
构造DeploymentController,包括clientset、workqueue和rsControl。其中rsControl是具体实现rs逻辑的controller。
dc := &DeploymentController{
client: client,
eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "deployment-controller"}),
queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"),
}
dc.rsControl = controller.RealRSControl{
KubeClient: client,
Recorder: dc.eventRecorder,
}
2.3. Informer().AddEventHandler
添加dInformer、rsInformer、podInformer的ResourceEventHandlerFuncs,其中主要为AddFunc、UpdateFunc、DeleteFunc三类方法。
dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: dc.addDeployment,
UpdateFunc: dc.updateDeployment,
// This will enter the sync loop and no-op, because the deployment has been deleted from the store.
DeleteFunc: dc.deleteDeployment,
})
rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: dc.addReplicaSet,
UpdateFunc: dc.updateReplicaSet,
DeleteFunc: dc.deleteReplicaSet,
})
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
DeleteFunc: dc.deletePod,
})
2.4. Informer.Lister()
调用dInformer、rsInformer和podInformer的Lister()方法。
dc.dLister = dInformer.Lister()
dc.rsLister = rsInformer.Lister()
dc.podLister = podInformer.Lister()
2.5. Informer().HasSynced
调用Informer().HasSynced,判断是否缓存完成;
dc.dListerSynced = dInformer.Informer().HasSynced
dc.rsListerSynced = rsInformer.Informer().HasSynced
dc.podListerSynced = podInformer.Informer().HasSynced
2.6. syncHandler
syncHandler具体为syncDeployment,syncHandler负责deployment的同步实现。
dc.syncHandler = dc.syncDeployment
dc.enqueueDeployment = dc.enqueue
完整代码如下:
// NewDeploymentController creates a new DeploymentController.
func NewDeploymentController(dInformer extensionsinformers.DeploymentInformer, rsInformer extensionsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {
eventBroadcaster := record.NewBroadcaster()
eventBroadcaster.StartLogging(glog.Infof)
// TODO: remove the wrapper when every clients have moved to use the clientset.
eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: v1core.New(client.CoreV1().RESTClient()).Events("")})
if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil {
if err := metrics.RegisterMetricAndTrackRateLimiterUsage("deployment_controller", client.CoreV1().RESTClient().GetRateLimiter()); err != nil {
return nil, err
}
}
dc := &DeploymentController{
client: client,
eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "deployment-controller"}),
queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"),
}
dc.rsControl = controller.RealRSControl{
KubeClient: client,
Recorder: dc.eventRecorder,
}
dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: dc.addDeployment,
UpdateFunc: dc.updateDeployment,
// This will enter the sync loop and no-op, because the deployment has been deleted from the store.
DeleteFunc: dc.deleteDeployment,
})
rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: dc.addReplicaSet,
UpdateFunc: dc.updateReplicaSet,
DeleteFunc: dc.deleteReplicaSet,
})
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
DeleteFunc: dc.deletePod,
})
dc.syncHandler = dc.syncDeployment
dc.enqueueDeployment = dc.enqueue
dc.dLister = dInformer.Lister()
dc.rsLister = rsInformer.Lister()
dc.podLister = podInformer.Lister()
dc.dListerSynced = dInformer.Informer().HasSynced
dc.rsListerSynced = rsInformer.Informer().HasSynced
dc.podListerSynced = podInformer.Informer().HasSynced
return dc, nil
}
3. DeploymentController.Run
Run执行watch和sync的操作。
// Run begins watching and syncing.
func (dc *DeploymentController) Run(workers int, stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
defer dc.queue.ShutDown()
glog.Infof("Starting deployment controller")
defer glog.Infof("Shutting down deployment controller")
if !controller.WaitForCacheSync("deployment", stopCh, dc.dListerSynced, dc.rsListerSynced, dc.podListerSynced) {
return
}
for i := 0; i < workers; i++ {
go wait.Until(dc.worker, time.Second, stopCh)
}
<-stopCh
}
3.1. WaitForCacheSync
WaitForCacheSync主要是用来在List-Watch机制中可以保持当前cache的数据与etcd的数据一致。
// WaitForCacheSync is a wrapper around cache.WaitForCacheSync that generates log messages
// indicating that the controller identified by controllerName is waiting for syncs, followed by
// either a successful or failed sync.
func WaitForCacheSync(controllerName string, stopCh <-chan struct{}, cacheSyncs ...cache.InformerSynced) bool {
glog.Infof("Waiting for caches to sync for %s controller", controllerName)
if !cache.WaitForCacheSync(stopCh, cacheSyncs...) {
utilruntime.HandleError(fmt.Errorf("Unable to sync caches for %s controller", controllerName))
return false
}
glog.Infof("Caches are synced for %s controller", controllerName)
return true
}
3.2. dc.worker
worker调用了processNextWorkItem,processNextWorkItem最终调用了syncHandler,而syncHandler在NewDeploymentController中赋值的具体函数为syncDeployment。
// worker runs a worker thread that just dequeues items, processes them, and marks them done.
// It enforces that the syncHandler is never invoked concurrently with the same key.
func (dc *DeploymentController) worker() {
for dc.processNextWorkItem() {
}
}
func (dc *DeploymentController) processNextWorkItem() bool {
key, quit := dc.queue.Get()
if quit {
return false
}
defer dc.queue.Done(key)
err := dc.syncHandler(key.(string))
dc.handleErr(err, key)
return true
}
NewDeploymentController中的syncHandler赋值:
func NewDeploymentController(dInformer appsinformers.DeploymentInformer, rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {
...
dc.syncHandler = dc.syncDeployment
...
}
4. syncDeployment
syncDeployment基于给定的key执行sync deployment的操作。
主要流程如下:
- 通过
SplitMetaNamespaceKey获取namespace和deployment对象的name。 - 调用Lister的接口获取的deployment的对象。
getReplicaSetsForDeployment获取deployment管理的ReplicaSet对象。getPodMapForDeployment获取deployment管理的pod,基于ReplicaSet来分组。checkPausedConditions检查deployment是否是pause状态并添加合适的condition。isScalingEvent检查deployment的更新是否来自于一个scale的事件,如果是则执行scale的操作。- 根据
DeploymentStrategyType类型执行rolloutRecreate或rolloutRolling。
完整代码如下:
// syncDeployment will sync the deployment with the given key.
// This function is not meant to be invoked concurrently with the same key.
func (dc *DeploymentController) syncDeployment(key string) error {
startTime := time.Now()
glog.V(4).Infof("Started syncing deployment %q (%v)", key, startTime)
defer func() {
glog.V(4).Infof("Finished syncing deployment %q (%v)", key, time.Since(startTime))
}()
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
deployment, err := dc.dLister.Deployments(namespace).Get(name)
if errors.IsNotFound(err) {
glog.V(2).Infof("Deployment %v has been deleted", key)
return nil
}
if err != nil {
return err
}
// Deep-copy otherwise we are mutating our cache.
// TODO: Deep-copy only when needed.
d := deployment.DeepCopy()
everything := metav1.LabelSelector{}
if reflect.DeepEqual(d.Spec.Selector, &everything) {
dc.eventRecorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all pods. A non-empty selector is required.")
if d.Status.ObservedGeneration < d.Generation {
d.Status.ObservedGeneration = d.Generation
dc.client.ExtensionsV1beta1().Deployments(d.Namespace).UpdateStatus(d)
}
return nil
}
// List ReplicaSets owned by this Deployment, while reconciling ControllerRef
// through adoption/orphaning.
rsList, err := dc.getReplicaSetsForDeployment(d)
if err != nil {
return err
}
// List all Pods owned by this Deployment, grouped by their ReplicaSet.
// Current uses of the podMap are:
//
// * check if a Pod is labeled correctly with the pod-template-hash label.
// * check that no old Pods are running in the middle of Recreate Deployments.
podMap, err := dc.getPodMapForDeployment(d, rsList)
if err != nil {
return err
}
if d.DeletionTimestamp != nil {
return dc.syncStatusOnly(d, rsList, podMap)
}
// Update deployment conditions with an Unknown condition when pausing/resuming
// a deployment. In this way, we can be sure that we won't timeout when a user
// resumes a Deployment with a set progressDeadlineSeconds.
if err = dc.checkPausedConditions(d); err != nil {
return err
}
if d.Spec.Paused {
return dc.sync(d, rsList, podMap)
}
// rollback is not re-entrant in case the underlying replica sets are updated with a new
// revision so we should ensure that we won't proceed to update replica sets until we
// make sure that the deployment has cleaned up its rollback spec in subsequent enqueues.
if d.Spec.RollbackTo != nil {
return dc.rollback(d, rsList, podMap)
}
scalingEvent, err := dc.isScalingEvent(d, rsList, podMap)
if err != nil {
return err
}
if scalingEvent {
return dc.sync(d, rsList, podMap)
}
switch d.Spec.Strategy.Type {
case extensions.RecreateDeploymentStrategyType:
return dc.rolloutRecreate(d, rsList, podMap)
case extensions.RollingUpdateDeploymentStrategyType:
return dc.rolloutRolling(d, rsList, podMap)
}
return fmt.Errorf("unexpected deployment strategy type: %s", d.Spec.Strategy.Type)
}
4.1. Get deployment
// get namespace and deployment name
namespace, name, err := cache.SplitMetaNamespaceKey(key)
// get deployment by name
deployment, err := dc.dLister.Deployments(namespace).Get(name)
4.2. getReplicaSetsForDeployment
// List ReplicaSets owned by this Deployment, while reconciling ControllerRef
// through adoption/orphaning.
rsList, err := dc.getReplicaSetsForDeployment(d)
getReplicaSetsForDeployment具体代码:
// getReplicaSetsForDeployment uses ControllerRefManager to reconcile
// ControllerRef by adopting and orphaning.
// It returns the list of ReplicaSets that this Deployment should manage.
func (dc *DeploymentController) getReplicaSetsForDeployment(d *apps.Deployment) ([]*apps.ReplicaSet, error) {
// List all ReplicaSets to find those we own but that no longer match our
// selector. They will be orphaned by ClaimReplicaSets().
rsList, err := dc.rsLister.ReplicaSets(d.Namespace).List(labels.Everything())
if err != nil {
return nil, err
}
deploymentSelector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
if err != nil {
return nil, fmt.Errorf("deployment %s/%s has invalid label selector: %v", d.Namespace, d.Name, err)
}
// If any adoptions are attempted, we should first recheck for deletion with
// an uncached quorum read sometime after listing ReplicaSets (see #42639).
canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {
fresh, err := dc.client.AppsV1().Deployments(d.Namespace).Get(d.Name, metav1.GetOptions{})
if err != nil {
return nil, err
}
if fresh.UID != d.UID {
return nil, fmt.Errorf("original Deployment %v/%v is gone: got uid %v, wanted %v", d.Namespace, d.Name, fresh.UID, d.UID)
}
return fresh, nil
})
cm := controller.NewReplicaSetControllerRefManager(dc.rsControl, d, deploymentSelector, controllerKind, canAdoptFunc)
return cm.ClaimReplicaSets(rsList)
}
4.3. getPodMapForDeployment
// List all Pods owned by this Deployment, grouped by their ReplicaSet.
// Current uses of the podMap are:
//
// * check if a Pod is labeled correctly with the pod-template-hash label.
// * check that no old Pods are running in the middle of Recreate Deployments.
podMap, err := dc.getPodMapForDeployment(d, rsList)
getPodMapForDeployment具体代码:
// getPodMapForDeployment returns the Pods managed by a Deployment.
//
// It returns a map from ReplicaSet UID to a list of Pods controlled by that RS,
// according to the Pod's ControllerRef.
func (dc *DeploymentController) getPodMapForDeployment(d *apps.Deployment, rsList []*apps.ReplicaSet) (map[types.UID]*v1.PodList, error) {
// Get all Pods that potentially belong to this Deployment.
selector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
if err != nil {
return nil, err
}
pods, err := dc.podLister.Pods(d.Namespace).List(selector)
if err != nil {
return nil, err
}
// Group Pods by their controller (if it's in rsList).
podMap := make(map[types.UID]*v1.PodList, len(rsList))
for _, rs := range rsList {
podMap[rs.UID] = &v1.PodList{}
}
for _, pod := range pods {
// Do not ignore inactive Pods because Recreate Deployments need to verify that no
// Pods from older versions are running before spinning up new Pods.
controllerRef := metav1.GetControllerOf(pod)
if controllerRef == nil {
continue
}
// Only append if we care about this UID.
if podList, ok := podMap[controllerRef.UID]; ok {
podList.Items = append(podList.Items, *pod)
}
}
return podMap, nil
}
4.4. checkPausedConditions
// Update deployment conditions with an Unknown condition when pausing/resuming
// a deployment. In this way, we can be sure that we won't timeout when a user
// resumes a Deployment with a set progressDeadlineSeconds.
if err = dc.checkPausedConditions(d); err != nil {
return err
}
if d.Spec.Paused {
return dc.sync(d, rsList)
}
checkPausedConditions具体代码:
// checkPausedConditions checks if the given deployment is paused or not and adds an appropriate condition.
// These conditions are needed so that we won't accidentally report lack of progress for resumed deployments
// that were paused for longer than progressDeadlineSeconds.
func (dc *DeploymentController) checkPausedConditions(d *apps.Deployment) error {
if !deploymentutil.HasProgressDeadline(d) {
return nil
}
cond := deploymentutil.GetDeploymentCondition(d.Status, apps.DeploymentProgressing)
if cond != nil && cond.Reason == deploymentutil.TimedOutReason {
// If we have reported lack of progress, do not overwrite it with a paused condition.
return nil
}
pausedCondExists := cond != nil && cond.Reason == deploymentutil.PausedDeployReason
needsUpdate := false
if d.Spec.Paused && !pausedCondExists {
condition := deploymentutil.NewDeploymentCondition(apps.DeploymentProgressing, v1.ConditionUnknown, deploymentutil.PausedDeployReason, "Deployment is paused")
deploymentutil.SetDeploymentCondition(&d.Status, *condition)
needsUpdate = true
} else if !d.Spec.Paused && pausedCondExists {
condition := deploymentutil.NewDeploymentCondition(apps.DeploymentProgressing, v1.ConditionUnknown, deploymentutil.ResumedDeployReason, "Deployment is resumed")
deploymentutil.SetDeploymentCondition(&d.Status, *condition)
needsUpdate = true
}
if !needsUpdate {
return nil
}
var err error
d, err = dc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(d)
return err
}
4.5. isScalingEvent
scalingEvent, err := dc.isScalingEvent(d, rsList)
if err != nil {
return err
}
if scalingEvent {
return dc.sync(d, rsList)
}
isScalingEvent具体代码:
// isScalingEvent checks whether the provided deployment has been updated with a scaling event
// by looking at the desired-replicas annotation in the active replica sets of the deployment.
//
// rsList should come from getReplicaSetsForDeployment(d).
// podMap should come from getPodMapForDeployment(d, rsList).
func (dc *DeploymentController) isScalingEvent(d *apps.Deployment, rsList []*apps.ReplicaSet) (bool, error) {
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
if err != nil {
return false, err
}
allRSs := append(oldRSs, newRS)
for _, rs := range controller.FilterActiveReplicaSets(allRSs) {
desired, ok := deploymentutil.GetDesiredReplicasAnnotation(rs)
if !ok {
continue
}
if desired != *(d.Spec.Replicas) {
return true, nil
}
}
return false, nil
}
4.6. rolloutRecreate
switch d.Spec.Strategy.Type {
case apps.RecreateDeploymentStrategyType:
return dc.rolloutRecreate(d, rsList, podMap)
rolloutRecreate具体代码:
// rolloutRecreate implements the logic for recreating a replica set.
func (dc *DeploymentController) rolloutRecreate(d *apps.Deployment, rsList []*apps.ReplicaSet, podMap map[types.UID]*v1.PodList) error {
// Don't create a new RS if not already existed, so that we avoid scaling up before scaling down.
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
if err != nil {
return err
}
allRSs := append(oldRSs, newRS)
activeOldRSs := controller.FilterActiveReplicaSets(oldRSs)
// scale down old replica sets.
scaledDown, err := dc.scaleDownOldReplicaSetsForRecreate(activeOldRSs, d)
if err != nil {
return err
}
if scaledDown {
// Update DeploymentStatus.
return dc.syncRolloutStatus(allRSs, newRS, d)
}
// Do not process a deployment when it has old pods running.
if oldPodsRunning(newRS, oldRSs, podMap) {
return dc.syncRolloutStatus(allRSs, newRS, d)
}
// If we need to create a new RS, create it now.
if newRS == nil {
newRS, oldRSs, err = dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
if err != nil {
return err
}
allRSs = append(oldRSs, newRS)
}
// scale up new replica set.
if _, err := dc.scaleUpNewReplicaSetForRecreate(newRS, d); err != nil {
return err
}
if util.DeploymentComplete(d, &d.Status) {
if err := dc.cleanupDeployment(oldRSs, d); err != nil {
return err
}
}
// Sync deployment status.
return dc.syncRolloutStatus(allRSs, newRS, d)
}
4.7. rolloutRolling
switch d.Spec.Strategy.Type {
case apps.RecreateDeploymentStrategyType:
return dc.rolloutRecreate(d, rsList, podMap)
case apps.RollingUpdateDeploymentStrategyType:
return dc.rolloutRolling(d, rsList)
}
rolloutRolling具体代码:
// rolloutRolling implements the logic for rolling a new replica set.
func (dc *DeploymentController) rolloutRolling(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
if err != nil {
return err
}
allRSs := append(oldRSs, newRS)
// Scale up, if we can.
scaledUp, err := dc.reconcileNewReplicaSet(allRSs, newRS, d)
if err != nil {
return err
}
if scaledUp {
// Update DeploymentStatus
return dc.syncRolloutStatus(allRSs, newRS, d)
}
// Scale down, if we can.
scaledDown, err := dc.reconcileOldReplicaSets(allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d)
if err != nil {
return err
}
if scaledDown {
// Update DeploymentStatus
return dc.syncRolloutStatus(allRSs, newRS, d)
}
if deploymentutil.DeploymentComplete(d, &d.Status) {
if err := dc.cleanupDeployment(oldRSs, d); err != nil {
return err
}
}
// Sync deployment status
return dc.syncRolloutStatus(allRSs, newRS, d)
}
5. 总结
startDeploymentController主要包括NewDeploymentController和DeploymentController.Run两部分。
NewDeploymentController主要构建DeploymentController结构体。
该部分主要处理了以下逻辑:
- 构建并运行事件处理器
eventBroadcaster。 - 初始化赋值
rsControl、clientset、workqueue。 - 添加
dInformer、rsInformer、podInformer的ResourceEventHandlerFuncs,其中主要为AddFunc、UpdateFunc、DeleteFunc三类方法。 - 构造deployment、rs、pod的Informer的Lister函数和HasSynced函数。
- 赋值
syncHandler,来实现syncDeployment。
DeploymentController.Run主要包含WaitForCacheSync和syncDeployment两部分。
syncDeployment基于给定的key执行sync deployment的操作。
主要流程如下:
- 通过
SplitMetaNamespaceKey获取namespace和deployment对象的name。 - 调用Lister的接口获取的deployment的对象。
getReplicaSetsForDeployment获取deployment管理的ReplicaSet对象。getPodMapForDeployment获取deployment管理的pod,基于ReplicaSet来分组。checkPausedConditions检查deployment是否是pause状态并添加合适的condition。isScalingEvent检查deployment的更新是否来自于一个scale的事件,如果是则执行scale的操作。- 根据
DeploymentStrategyType类型执行rolloutRecreate或rolloutRolling。
参考:
11.3.2 -
kube-controller-manager源码分析(一)之 NewControllerManagerCommand
以下代码分析基于
kubernetes v1.12.0版本。本文主要分析https://github.com/kubernetes/kubernetes/tree/v1.12.0/cmd/kube-controller-manager 部分的代码。
本文主要分析 kubernetes/cmd/kube-controller-manager部分,该部分主要涉及各种类型的controller的参数解析,及初始化,例如 deployment controller 和statefulset controller。并没有具体controller运行的详细逻辑,该部分位于kubernetes/pkg/controller模块,待后续文章分析。
kube-controller-manager的cmd部分代码目录结构如下:
kube-controller-manager
├── app
│ ├── apps.go # 包含:startDeploymentController、startReplicaSetController、startStatefulSetController、startDaemonSetController
│ ├── autoscaling.go # startHPAController
│ ├── batch.go # startJobController、startCronJobController
│ ├── bootstrap.go
│ ├── certificates.go
│ ├── cloudproviders.go
│ ├── config
│ │ └── config.go # config: controller manager执行的上下文
│ ├── controllermanager.go # 包含:NewControllerManagerCommand、Run、NewControllerInitializers、StartControllers等
│ ├── core.go # startServiceController、startNodeIpamController、startPersistentVolumeBinderController、startNamespaceController等
│ ├── options # 包含不同controller的option参数
│ │ ├── attachdetachcontroller.go
│ │ ├── csrsigningcontroller.go
│ │ ├── daemonsetcontroller.go # DaemonSetControllerOptions
│ │ ├── deploymentcontroller.go # DeploymentControllerOptions
│ │ ├── deprecatedcontroller.go
│ │ ├── endpointcontroller.go
│ │ ├── garbagecollectorcontroller.go
│ │ ├── hpacontroller.go
│ │ ├── jobcontroller.go
│ │ ├── namespacecontroller.go # NamespaceControllerOptions
│ │ ├── nodeipamcontroller.go
│ │ ├── nodelifecyclecontroller.go
│ │ ├── options.go # KubeControllerManagerOptions、NewKubeControllerManagerOptions
│ │ ├── persistentvolumebindercontroller.go
│ │ ├── podgccontroller.go
│ │ ├── replicasetcontroller.go # ReplicaSetControllerOptions
│ │ ├── replicationcontroller.go
│ │ ├── resourcequotacontroller.go
│ │ ├── serviceaccountcontroller.go
│ │ └── ttlafterfinishedcontroller.go
└── controller-manager.go # main入口函数
1. Main函数
kube-controller-manager的入口函数Main函数,仍然是采用统一的代码风格,使用Cobra命令行框架。
func main() {
rand.Seed(time.Now().UTC().UnixNano())
command := app.NewControllerManagerCommand()
// TODO: once we switch everything over to Cobra commands, we can go back to calling
// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the
// normalize func and add the go flag set by hand.
pflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc)
pflag.CommandLine.AddGoFlagSet(goflag.CommandLine)
// utilflag.InitFlags()
logs.InitLogs()
defer logs.FlushLogs()
if err := command.Execute(); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
}
核心代码:
// 初始化命令行结构体
command := app.NewControllerManagerCommand()
// 执行Execute
err := command.Execute()
2. NewControllerManagerCommand
该部分代码位于:kubernetes/cmd/kube-controller-manager/app/controllermanager.go
// NewControllerManagerCommand creates a *cobra.Command object with default parameters
func NewControllerManagerCommand() *cobra.Command {
...
cmd := &cobra.Command{
Use: "kube-controller-manager",
Long: `The Kubernetes controller manager is a daemon that embeds
the core control loops shipped with Kubernetes. In applications of robotics and
automation, a control loop is a non-terminating loop that regulates the state of
the system. In Kubernetes, a controller is a control loop that watches the shared
state of the cluster through the apiserver and makes changes attempting to move the
current state towards the desired state. Examples of controllers that ship with
Kubernetes today are the replication controller, endpoints controller, namespace
controller, and serviceaccounts controller.`,
Run: func(cmd *cobra.Command, args []string) {
verflag.PrintAndExitIfRequested()
utilflag.PrintFlags(cmd.Flags())
c, err := s.Config(KnownControllers(), ControllersDisabledByDefault.List())
if err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
if err := Run(c.Complete(), wait.NeverStop); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
},
}
...
}
构建一个*cobra.Command对象,然后执行Run函数。
2.1. NewKubeControllerManagerOptions
s, err := options.NewKubeControllerManagerOptions()
if err != nil {
glog.Fatalf("unable to initialize command options: %v", err)
}
初始化controllerManager的参数,其中主要包括了各种controller的option,例如DeploymentControllerOptions:
// DeploymentControllerOptions holds the DeploymentController options.
type DeploymentControllerOptions struct {
ConcurrentDeploymentSyncs int32
DeploymentControllerSyncPeriod metav1.Duration
}
具体代码如下:
// NewKubeControllerManagerOptions creates a new KubeControllerManagerOptions with a default config.
func NewKubeControllerManagerOptions() (*KubeControllerManagerOptions, error) {
componentConfig, err := NewDefaultComponentConfig(ports.InsecureKubeControllerManagerPort)
if err != nil {
return nil, err
}
s := KubeControllerManagerOptions{
Generic: cmoptions.NewGenericControllerManagerConfigurationOptions(componentConfig.Generic),
KubeCloudShared: cmoptions.NewKubeCloudSharedOptions(componentConfig.KubeCloudShared),
AttachDetachController: &AttachDetachControllerOptions{
ReconcilerSyncLoopPeriod: componentConfig.AttachDetachController.ReconcilerSyncLoopPeriod,
},
CSRSigningController: &CSRSigningControllerOptions{
ClusterSigningCertFile: componentConfig.CSRSigningController.ClusterSigningCertFile,
ClusterSigningKeyFile: componentConfig.CSRSigningController.ClusterSigningKeyFile,
ClusterSigningDuration: componentConfig.CSRSigningController.ClusterSigningDuration,
},
DaemonSetController: &DaemonSetControllerOptions{
ConcurrentDaemonSetSyncs: componentConfig.DaemonSetController.ConcurrentDaemonSetSyncs,
},
DeploymentController: &DeploymentControllerOptions{
ConcurrentDeploymentSyncs: componentConfig.DeploymentController.ConcurrentDeploymentSyncs,
DeploymentControllerSyncPeriod: componentConfig.DeploymentController.DeploymentControllerSyncPeriod,
},
DeprecatedFlags: &DeprecatedControllerOptions{
RegisterRetryCount: componentConfig.DeprecatedController.RegisterRetryCount,
},
EndpointController: &EndpointControllerOptions{
ConcurrentEndpointSyncs: componentConfig.EndpointController.ConcurrentEndpointSyncs,
},
GarbageCollectorController: &GarbageCollectorControllerOptions{
ConcurrentGCSyncs: componentConfig.GarbageCollectorController.ConcurrentGCSyncs,
EnableGarbageCollector: componentConfig.GarbageCollectorController.EnableGarbageCollector,
},
HPAController: &HPAControllerOptions{
HorizontalPodAutoscalerSyncPeriod: componentConfig.HPAController.HorizontalPodAutoscalerSyncPeriod,
HorizontalPodAutoscalerUpscaleForbiddenWindow: componentConfig.HPAController.HorizontalPodAutoscalerUpscaleForbiddenWindow,
HorizontalPodAutoscalerDownscaleForbiddenWindow: componentConfig.HPAController.HorizontalPodAutoscalerDownscaleForbiddenWindow,
HorizontalPodAutoscalerDownscaleStabilizationWindow: componentConfig.HPAController.HorizontalPodAutoscalerDownscaleStabilizationWindow,
HorizontalPodAutoscalerCPUInitializationPeriod: componentConfig.HPAController.HorizontalPodAutoscalerCPUInitializationPeriod,
HorizontalPodAutoscalerInitialReadinessDelay: componentConfig.HPAController.HorizontalPodAutoscalerInitialReadinessDelay,
HorizontalPodAutoscalerTolerance: componentConfig.HPAController.HorizontalPodAutoscalerTolerance,
HorizontalPodAutoscalerUseRESTClients: componentConfig.HPAController.HorizontalPodAutoscalerUseRESTClients,
},
JobController: &JobControllerOptions{
ConcurrentJobSyncs: componentConfig.JobController.ConcurrentJobSyncs,
},
NamespaceController: &NamespaceControllerOptions{
NamespaceSyncPeriod: componentConfig.NamespaceController.NamespaceSyncPeriod,
ConcurrentNamespaceSyncs: componentConfig.NamespaceController.ConcurrentNamespaceSyncs,
},
NodeIPAMController: &NodeIPAMControllerOptions{
NodeCIDRMaskSize: componentConfig.NodeIPAMController.NodeCIDRMaskSize,
},
NodeLifecycleController: &NodeLifecycleControllerOptions{
EnableTaintManager: componentConfig.NodeLifecycleController.EnableTaintManager,
NodeMonitorGracePeriod: componentConfig.NodeLifecycleController.NodeMonitorGracePeriod,
NodeStartupGracePeriod: componentConfig.NodeLifecycleController.NodeStartupGracePeriod,
PodEvictionTimeout: componentConfig.NodeLifecycleController.PodEvictionTimeout,
},
PersistentVolumeBinderController: &PersistentVolumeBinderControllerOptions{
PVClaimBinderSyncPeriod: componentConfig.PersistentVolumeBinderController.PVClaimBinderSyncPeriod,
VolumeConfiguration: componentConfig.PersistentVolumeBinderController.VolumeConfiguration,
},
PodGCController: &PodGCControllerOptions{
TerminatedPodGCThreshold: componentConfig.PodGCController.TerminatedPodGCThreshold,
},
ReplicaSetController: &ReplicaSetControllerOptions{
ConcurrentRSSyncs: componentConfig.ReplicaSetController.ConcurrentRSSyncs,
},
ReplicationController: &ReplicationControllerOptions{
ConcurrentRCSyncs: componentConfig.ReplicationController.ConcurrentRCSyncs,
},
ResourceQuotaController: &ResourceQuotaControllerOptions{
ResourceQuotaSyncPeriod: componentConfig.ResourceQuotaController.ResourceQuotaSyncPeriod,
ConcurrentResourceQuotaSyncs: componentConfig.ResourceQuotaController.ConcurrentResourceQuotaSyncs,
},
SAController: &SAControllerOptions{
ConcurrentSATokenSyncs: componentConfig.SAController.ConcurrentSATokenSyncs,
},
ServiceController: &cmoptions.ServiceControllerOptions{
ConcurrentServiceSyncs: componentConfig.ServiceController.ConcurrentServiceSyncs,
},
TTLAfterFinishedController: &TTLAfterFinishedControllerOptions{
ConcurrentTTLSyncs: componentConfig.TTLAfterFinishedController.ConcurrentTTLSyncs,
},
SecureServing: apiserveroptions.NewSecureServingOptions().WithLoopback(),
InsecureServing: (&apiserveroptions.DeprecatedInsecureServingOptions{
BindAddress: net.ParseIP(componentConfig.Generic.Address),
BindPort: int(componentConfig.Generic.Port),
BindNetwork: "tcp",
}).WithLoopback(),
Authentication: apiserveroptions.NewDelegatingAuthenticationOptions(),
Authorization: apiserveroptions.NewDelegatingAuthorizationOptions(),
}
s.Authentication.RemoteKubeConfigFileOptional = true
s.Authorization.RemoteKubeConfigFileOptional = true
s.Authorization.AlwaysAllowPaths = []string{"/healthz"}
s.SecureServing.ServerCert.CertDirectory = "/var/run/kubernetes"
s.SecureServing.ServerCert.PairName = "kube-controller-manager"
s.SecureServing.BindPort = ports.KubeControllerManagerPort
gcIgnoredResources := make([]kubectrlmgrconfig.GroupResource, 0, len(garbagecollector.DefaultIgnoredResources()))
for r := range garbagecollector.DefaultIgnoredResources() {
gcIgnoredResources = append(gcIgnoredResources, kubectrlmgrconfig.GroupResource{Group: r.Group, Resource: r.Resource})
}
s.GarbageCollectorController.GCIgnoredResources = gcIgnoredResources
return &s, nil
}
2.2. AddFlagSet
添加参数及帮助函数。
fs := cmd.Flags()
namedFlagSets := s.Flags(KnownControllers(), ControllersDisabledByDefault.List())
for _, f := range namedFlagSets.FlagSets {
fs.AddFlagSet(f)
}
usageFmt := "Usage:\n %s\n"
cols, _, _ := apiserverflag.TerminalSize(cmd.OutOrStdout())
cmd.SetUsageFunc(func(cmd *cobra.Command) error {
fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())
apiserverflag.PrintSections(cmd.OutOrStderr(), namedFlagSets, cols)
return nil
})
cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine())
apiserverflag.PrintSections(cmd.OutOrStdout(), namedFlagSets, cols)
})
3. Run
此部分的代码位于cmd/kube-controller-manager/app/controllermanager.go
基于KubeControllerManagerOptions运行controllerManager,不退出。
// Run runs the KubeControllerManagerOptions. This should never exit.
func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {
...
run := func(ctx context.Context) {
...
controllerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done())
if err != nil {
glog.Fatalf("error building controller context: %v", err)
}
saTokenControllerInitFunc := serviceAccountTokenControllerStarter{rootClientBuilder: rootClientBuilder}.startServiceAccountTokenController
if err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil {
glog.Fatalf("error starting controllers: %v", err)
}
controllerContext.InformerFactory.Start(controllerContext.Stop)
close(controllerContext.InformersStarted)
select {}
}
...
}
Run函数涉及的核心代码如下:
// 创建controller的context
controllerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done())
// 启动各种controller
err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux)
其中StartControllers中的入参NewControllerInitializers初始化了各种controller。
3.1. CreateControllerContext
CreateControllerContext构建了各种controller所需的资源的上下文,各种controller在启动时,入参为该context,具体参考initFn(ctx)。
// CreateControllerContext creates a context struct containing references to resources needed by the
// controllers such as the cloud provider and clientBuilder. rootClientBuilder is only used for
// the shared-informers client and token controller.
func CreateControllerContext(s *config.CompletedConfig, rootClientBuilder, clientBuilder controller.ControllerClientBuilder, stop <-chan struct{}) (ControllerContext, error) {
versionedClient := rootClientBuilder.ClientOrDie("shared-informers")
sharedInformers := informers.NewSharedInformerFactory(versionedClient, ResyncPeriod(s)())
// If apiserver is not running we should wait for some time and fail only then. This is particularly
// important when we start apiserver and controller manager at the same time.
if err := genericcontrollermanager.WaitForAPIServer(versionedClient, 10*time.Second); err != nil {
return ControllerContext{}, fmt.Errorf("failed to wait for apiserver being healthy: %v", err)
}
// Use a discovery client capable of being refreshed.
discoveryClient := rootClientBuilder.ClientOrDie("controller-discovery")
cachedClient := cacheddiscovery.NewMemCacheClient(discoveryClient.Discovery())
restMapper := restmapper.NewDeferredDiscoveryRESTMapper(cachedClient)
go wait.Until(func() {
restMapper.Reset()
}, 30*time.Second, stop)
availableResources, err := GetAvailableResources(rootClientBuilder)
if err != nil {
return ControllerContext{}, err
}
cloud, loopMode, err := createCloudProvider(s.ComponentConfig.KubeCloudShared.CloudProvider.Name, s.ComponentConfig.KubeCloudShared.ExternalCloudVolumePlugin,
s.ComponentConfig.KubeCloudShared.CloudProvider.CloudConfigFile, s.ComponentConfig.KubeCloudShared.AllowUntaggedCloud, sharedInformers)
if err != nil {
return ControllerContext{}, err
}
ctx := ControllerContext{
ClientBuilder: clientBuilder,
InformerFactory: sharedInformers,
ComponentConfig: s.ComponentConfig,
RESTMapper: restMapper,
AvailableResources: availableResources,
Cloud: cloud,
LoopMode: loopMode,
Stop: stop,
InformersStarted: make(chan struct{}),
ResyncPeriod: ResyncPeriod(s),
}
return ctx, nil
}
核心代码为NewSharedInformerFactory。
// 创建SharedInformerFactory
sharedInformers := informers.NewSharedInformerFactory(versionedClient, ResyncPeriod(s)())
// 赋值给ControllerContext
ctx := ControllerContext{
InformerFactory: sharedInformers,
}
SharedInformerFactory提供了公共的k8s对象的informers。
// SharedInformerFactory provides shared informers for resources in all known
// API group versions.
type SharedInformerFactory interface {
internalinterfaces.SharedInformerFactory
ForResource(resource schema.GroupVersionResource) (GenericInformer, error)
WaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool
Admissionregistration() admissionregistration.Interface
Apps() apps.Interface
Autoscaling() autoscaling.Interface
Batch() batch.Interface
Certificates() certificates.Interface
Coordination() coordination.Interface
Core() core.Interface
Events() events.Interface
Extensions() extensions.Interface
Networking() networking.Interface
Policy() policy.Interface
Rbac() rbac.Interface
Scheduling() scheduling.Interface
Settings() settings.Interface
Storage() storage.Interface
}
3.2. NewControllerInitializers
NewControllerInitializers定义了各种controller的类型和其对于的启动函数,例如deployment``、statefulset、replicaset、replicationcontroller、namespace等。
// NewControllerInitializers is a public map of named controller groups (you can start more than one in an init func)
// paired to their InitFunc. This allows for structured downstream composition and subdivision.
func NewControllerInitializers(loopMode ControllerLoopMode) map[string]InitFunc {
controllers := map[string]InitFunc{}
controllers["endpoint"] = startEndpointController
controllers["replicationcontroller"] = startReplicationController
controllers["podgc"] = startPodGCController
controllers["resourcequota"] = startResourceQuotaController
controllers["namespace"] = startNamespaceController
controllers["serviceaccount"] = startServiceAccountController
controllers["garbagecollector"] = startGarbageCollectorController
controllers["daemonset"] = startDaemonSetController
controllers["job"] = startJobController
controllers["deployment"] = startDeploymentController
controllers["replicaset"] = startReplicaSetController
controllers["horizontalpodautoscaling"] = startHPAController
controllers["disruption"] = startDisruptionController
controllers["statefulset"] = startStatefulSetController
controllers["cronjob"] = startCronJobController
controllers["csrsigning"] = startCSRSigningController
controllers["csrapproving"] = startCSRApprovingController
controllers["csrcleaner"] = startCSRCleanerController
controllers["ttl"] = startTTLController
controllers["bootstrapsigner"] = startBootstrapSignerController
controllers["tokencleaner"] = startTokenCleanerController
controllers["nodeipam"] = startNodeIpamController
if loopMode == IncludeCloudLoops {
controllers["service"] = startServiceController
controllers["route"] = startRouteController
// TODO: volume controller into the IncludeCloudLoops only set.
// TODO: Separate cluster in cloud check from node lifecycle controller.
}
controllers["nodelifecycle"] = startNodeLifecycleController
controllers["persistentvolume-binder"] = startPersistentVolumeBinderController
controllers["attachdetach"] = startAttachDetachController
controllers["persistentvolume-expander"] = startVolumeExpandController
controllers["clusterrole-aggregation"] = startClusterRoleAggregrationController
controllers["pvc-protection"] = startPVCProtectionController
controllers["pv-protection"] = startPVProtectionController
controllers["ttl-after-finished"] = startTTLAfterFinishedController
return controllers
}
3.3. StartControllers
func StartControllers(ctx ControllerContext, startSATokenController InitFunc, controllers map[string]InitFunc, unsecuredMux *mux.PathRecorderMux) error {
...
for controllerName, initFn := range controllers {
if !ctx.IsControllerEnabled(controllerName) {
glog.Warningf("%q is disabled", controllerName)
continue
}
time.Sleep(wait.Jitter(ctx.ComponentConfig.Generic.ControllerStartInterval.Duration, ControllerStartJitter))
glog.V(1).Infof("Starting %q", controllerName)
debugHandler, started, err := initFn(ctx)
if err != nil {
glog.Errorf("Error starting %q", controllerName)
return err
}
if !started {
glog.Warningf("Skipping %q", controllerName)
continue
}
if debugHandler != nil && unsecuredMux != nil {
basePath := "/debug/controllers/" + controllerName
unsecuredMux.UnlistedHandle(basePath, http.StripPrefix(basePath, debugHandler))
unsecuredMux.UnlistedHandlePrefix(basePath+"/", http.StripPrefix(basePath, debugHandler))
}
glog.Infof("Started %q", controllerName)
}
return nil
}
核心代码:
for controllerName, initFn := range controllers {
debugHandler, started, err := initFn(ctx)
}
启动各种controller,controller的启动函数在NewControllerInitializers中定义了,例如:
// deployment
controllers["deployment"] = startDeploymentController
// statefulset
controllers["statefulset"] = startStatefulSetController
3.4. InformerFactory.Start
InformerFactory实际上是SharedInformerFactory,具体的实现逻辑在client-go中的informer的实现机制。
controllerContext.InformerFactory.Start(controllerContext.Stop)
close(controllerContext.InformersStarted)
3.4.1. SharedInformerFactory
SharedInformerFactory是一个informer工厂的接口定义。
// SharedInformerFactory a small interface to allow for adding an informer without an import cycle
type SharedInformerFactory interface {
Start(stopCh <-chan struct{})
InformerFor(obj runtime.Object, newFunc NewInformerFunc) cache.SharedIndexInformer
}
3.4.2. sharedInformerFactory.Start
Start方法初始化各种类型的informer
// Start initializes all requested informers.
func (f *sharedInformerFactory) Start(stopCh <-chan struct{}) {
f.lock.Lock()
defer f.lock.Unlock()
for informerType, informer := range f.informers {
if !f.startedInformers[informerType] {
go informer.Run(stopCh)
f.startedInformers[informerType] = true
}
}
}
3.4.3. sharedIndexInformer.Run
sharedIndexInformer.Run具体运行了sharedIndexInformer的实现逻辑,该部分待后续对informer机制做专题分析。
func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, nil, s.indexer)
cfg := &Config{
Queue: fifo,
ListerWatcher: s.listerWatcher,
ObjectType: s.objectType,
FullResyncPeriod: s.resyncCheckPeriod,
RetryOnError: false,
ShouldResync: s.processor.shouldResync,
Process: s.HandleDeltas,
}
func() {
s.startedLock.Lock()
defer s.startedLock.Unlock()
s.controller = New(cfg)
s.controller.(*controller).clock = s.clock
s.started = true
}()
// Separate stop channel because Processor should be stopped strictly after controller
processorStopCh := make(chan struct{})
var wg wait.Group
defer wg.Wait() // Wait for Processor to stop
defer close(processorStopCh) // Tell Processor to stop
wg.StartWithChannel(processorStopCh, s.cacheMutationDetector.Run)
wg.StartWithChannel(processorStopCh, s.processor.run)
defer func() {
s.startedLock.Lock()
defer s.startedLock.Unlock()
s.stopped = true // Don't want any new listeners
}()
s.controller.Run(stopCh)
}
4. initFn(ctx)
initFn实际调用的就是各种类型的controller,代码位于kubernetes/cmd/kube-controller-manager/app/apps.go,本文以startStatefulSetController和startDeploymentController为例,controller中实际调用的函数逻辑位于kubernetes/pkg/controller中,待后续分析。
4.1. startStatefulSetController
func startStatefulSetController(ctx ControllerContext) (http.Handler, bool, error) {
if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "statefulsets"}] {
return nil, false, nil
}
go statefulset.NewStatefulSetController(
ctx.InformerFactory.Core().V1().Pods(),
ctx.InformerFactory.Apps().V1().StatefulSets(),
ctx.InformerFactory.Core().V1().PersistentVolumeClaims(),
ctx.InformerFactory.Apps().V1().ControllerRevisions(),
ctx.ClientBuilder.ClientOrDie("statefulset-controller"),
).Run(1, ctx.Stop)
return nil, true, nil
}
其中使用到了InformerFactory,包含了Pods、StatefulSets、PersistentVolumeClaims、ControllerRevisions的informer。
startStatefulSetController主要调用的函数为NewStatefulSetController和对应的Run函数。
4.2. startDeploymentController
func startDeploymentController(ctx ControllerContext) (http.Handler, bool, error) {
if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "deployments"}] {
return nil, false, nil
}
dc, err := deployment.NewDeploymentController(
ctx.InformerFactory.Apps().V1().Deployments(),
ctx.InformerFactory.Apps().V1().ReplicaSets(),
ctx.InformerFactory.Core().V1().Pods(),
ctx.ClientBuilder.ClientOrDie("deployment-controller"),
)
if err != nil {
return nil, true, fmt.Errorf("error creating Deployment controller: %v", err)
}
go dc.Run(int(ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs), ctx.Stop)
return nil, true, nil
}
startDeploymentController主要调用的函数为NewDeploymentController和对应的Run函数。该部分逻辑在kubernetes/pkg/controller中。
5. 总结
- Kube-controller-manager的代码风格仍然是Cobra命令行框架。通过构造
ControllerManagerCommand,然后执行command.Execute()函数。基本的流程就是构造option,添加Flags,执行Run函数。 - cmd部分的调用流程如下:
Main-->NewControllerManagerCommand--> Run(c.Complete(), wait.NeverStop)-->StartControllers-->initFn(ctx)-->startDeploymentController/startStatefulSetController-->sts.NewStatefulSetController.Run/dc.NewDeploymentController.Run-->pkg/controller。 - 其中
CreateControllerContext函数用来创建各类型controller所需要使用的context,NewControllerInitializers初始化了各种类型的controller,其中就包括DeploymentController和StatefulSetController等。
基本流程如下:
- 构造controller manager option,并转化为Config对象,执行Run函数。
- 基于Config对象创建ControllerContext,其中包含InformerFactory。
- 基于ControllerContext运行各种controller,各种controller的定义在
NewControllerInitializers中。 - 执行InformerFactory.Start。
- 每种controller都会构造自身的结构体并执行对应的Run函数。
参考:
- https://github.com/kubernetes/kubernetes/tree/v1.12.0/cmd/kube-controller-manager
- https://github.com/kubernetes/kubernetes/blob/v1.12.0/cmd/kube-controller-manager/controller-manager.go
- https://github.com/kubernetes/kubernetes/blob/v1.12.0/cmd/kube-controller-manager/app/controllermanager.go
- https://github.com/kubernetes/kubernetes/blob/v1.12.0/cmd/kube-controller-manager/app/apps.go
11.3.3 -
kube-controller-manager源码分析(三)之 Informer机制
以下代码分析基于
kubernetes v1.12.0版本。
本文主要分析k8s中各个核心组件经常使用到的Informer机制(即List-Watch)。该部分的代码主要位于client-go这个第三方包中。
此部分的逻辑主要位于/vendor/k8s.io/client-go/tools/cache包中,代码目录结构如下:
cache
├── controller.go # 包含:Config、Run、processLoop、NewInformer、NewIndexerInformer
├── delta_fifo.go # 包含:NewDeltaFIFO、DeltaFIFO、AddIfNotPresent
├── expiration_cache.go
├── expiration_cache_fakes.go
├── fake_custom_store.go
├── fifo.go # 包含:Queue、FIFO、NewFIFO
├── heap.go
├── index.go # 包含:Indexer、MetaNamespaceIndexFunc
├── listers.go
├── listwatch.go # 包含:ListerWatcher、ListWatch、List、Watch
├── mutation_cache.go
├── mutation_detector.go
├── reflector.go # 包含:Reflector、NewReflector、Run、ListAndWatch
├── reflector_metrics.go
├── shared_informer.go # 包含:NewSharedInformer、WaitForCacheSync、Run、HasSynced
├── store.go # 包含:Store、MetaNamespaceKeyFunc、SplitMetaNamespaceKey
├── testing
│ ├── fake_controller_source.go
├── thread_safe_store.go # 包含:ThreadSafeStore、threadSafeMap
├── undelta_store.go
0. 原理示意图
示意图1:
示意图2:
0.1. client-go组件
-
Reflector:reflector用来watch特定的k8s API资源。具体的实现是通过ListAndWatch的方法,watch可以是k8s内建的资源或者是自定义的资源。当reflector通过watch API接收到有关新资源实例存在的通知时,它使用相应的列表API获取新创建的对象,并将其放入watchHandler函数内的Delta Fifo队列中。 -
Informer:informer从Delta Fifo队列中弹出对象。执行此操作的功能是processLoop。base controller的作用是保存对象以供以后检索,并调用我们的控制器将对象传递给它。 -
Indexer:索引器提供对象的索引功能。典型的索引用例是基于对象标签创建索引。 Indexer可以根据多个索引函数维护索引。Indexer使用线程安全的数据存储来存储对象及其键。 在Store中定义了一个名为MetaNamespaceKeyFunc的默认函数,该函数生成对象的键作为该对象的<namespace> / <name>组合。
0.2. 自定义controller组件
-
Informer reference:指的是Informer实例的引用,定义如何使用自定义资源对象。 自定义控制器代码需要创建对应的Informer。 -
Indexer reference: 自定义控制器对Indexer实例的引用。自定义控制器需要创建对应的Indexser。
client-go中提供
NewIndexerInformer函数可以创建Informer 和 Indexer。
-
Resource Event Handlers:资源事件回调函数,当它想要将对象传递给控制器时,它将被调用。 编写这些函数的典型模式是获取调度对象的key,并将该key排入工作队列以进行进一步处理。 -
Work queue:任务队列。 编写资源事件处理程序函数以提取传递的对象的key并将其添加到任务队列。 -
Process Item:处理任务队列中对象的函数, 这些函数通常使用Indexer引用或Listing包装器来重试与该key对应的对象。
1. sharedInformerFactory.Start
在controller-manager的Run函数部分调用了InformerFactory.Start的方法。
此部分代码位于/cmd/kube-controller-manager/app/controllermanager.go
// Run runs the KubeControllerManagerOptions. This should never exit.
func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {
...
controllerContext.InformerFactory.Start(controllerContext.Stop)
close(controllerContext.InformersStarted)
...
}
InformerFactory是一个SharedInformerFactory的接口,接口定义如下:
此部分代码位于vendor/k8s.io/client-go/informers/internalinterfaces/factory_interfaces.go
// SharedInformerFactory a small interface to allow for adding an informer without an import cycle
type SharedInformerFactory interface {
Start(stopCh <-chan struct{})
InformerFor(obj runtime.Object, newFunc NewInformerFunc) cache.SharedIndexInformer
}
Start方法初始化各种类型的informer,并且每个类型起了个informer.Run的goroutine。
此部分代码位于vendor/k8s.io/client-go/informers/factory.go
// Start initializes all requested informers.
func (f *sharedInformerFactory) Start(stopCh <-chan struct{}) {
f.lock.Lock()
defer f.lock.Unlock()
for informerType, informer := range f.informers {
if !f.startedInformers[informerType] {
go informer.Run(stopCh)
f.startedInformers[informerType] = true
}
}
}
2. sharedIndexInformer.Run
此部分的代码位于/vendor/k8s.io/client-go/tools/cache/shared_informer.go
func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, nil, s.indexer)
cfg := &Config{
Queue: fifo,
ListerWatcher: s.listerWatcher,
ObjectType: s.objectType,
FullResyncPeriod: s.resyncCheckPeriod,
RetryOnError: false,
ShouldResync: s.processor.shouldResync,
Process: s.HandleDeltas,
}
func() {
s.startedLock.Lock()
defer s.startedLock.Unlock()
s.controller = New(cfg)
s.controller.(*controller).clock = s.clock
s.started = true
}()
// Separate stop channel because Processor should be stopped strictly after controller
processorStopCh := make(chan struct{})
var wg wait.Group
defer wg.Wait() // Wait for Processor to stop
defer close(processorStopCh) // Tell Processor to stop
wg.StartWithChannel(processorStopCh, s.cacheMutationDetector.Run)
wg.StartWithChannel(processorStopCh, s.processor.run)
defer func() {
s.startedLock.Lock()
defer s.startedLock.Unlock()
s.stopped = true // Don't want any new listeners
}()
s.controller.Run(stopCh)
}
2.1. NewDeltaFIFO
DeltaFIFO是一个对象变化的存储队列,依据先进先出的原则,process的函数接收该队列的Pop方法的输出对象来处理相关功能。
fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, nil, s.indexer)
2.2. Config
构造controller的配置文件,构造process,即HandleDeltas,该函数为后面使用到的process函数。
cfg := &Config{
Queue: fifo,
ListerWatcher: s.listerWatcher,
ObjectType: s.objectType,
FullResyncPeriod: s.resyncCheckPeriod,
RetryOnError: false,
ShouldResync: s.processor.shouldResync,
Process: s.HandleDeltas,
}
2.3. controller
调用New(cfg),构建sharedIndexInformer的controller。
func() {
s.startedLock.Lock()
defer s.startedLock.Unlock()
s.controller = New(cfg)
s.controller.(*controller).clock = s.clock
s.started = true
}()
2.4. cacheMutationDetector.Run
调用s.cacheMutationDetector.Run,检查缓存对象是否变化。
wg.StartWithChannel(processorStopCh, s.cacheMutationDetector.Run)
defaultCacheMutationDetector.Run
func (d *defaultCacheMutationDetector) Run(stopCh <-chan struct{}) {
// we DON'T want protection from panics. If we're running this code, we want to die
for {
d.CompareObjects()
select {
case <-stopCh:
return
case <-time.After(d.period):
}
}
}
CompareObjects
func (d *defaultCacheMutationDetector) CompareObjects() {
d.lock.Lock()
defer d.lock.Unlock()
altered := false
for i, obj := range d.cachedObjs {
if !reflect.DeepEqual(obj.cached, obj.copied) {
fmt.Printf("CACHE %s[%d] ALTERED!\n%v\n", d.name, i, diff.ObjectDiff(obj.cached, obj.copied))
altered = true
}
}
if altered {
msg := fmt.Sprintf("cache %s modified", d.name)
if d.failureFunc != nil {
d.failureFunc(msg)
return
}
panic(msg)
}
}
2.5. processor.run
调用s.processor.run,将调用sharedProcessor.run,会调用Listener.run和Listener.pop,执行处理queue的函数。
wg.StartWithChannel(processorStopCh, s.processor.run)
sharedProcessor.Run
func (p *sharedProcessor) run(stopCh <-chan struct{}) {
func() {
p.listenersLock.RLock()
defer p.listenersLock.RUnlock()
for _, listener := range p.listeners {
p.wg.Start(listener.run)
p.wg.Start(listener.pop)
}
}()
<-stopCh
p.listenersLock.RLock()
defer p.listenersLock.RUnlock()
for _, listener := range p.listeners {
close(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop
}
p.wg.Wait() // Wait for all .pop() and .run() to stop
}
该部分逻辑待后面分析。
2.6. controller.Run
调用s.controller.Run,构建Reflector,进行对etcd的缓存
defer func() {
s.startedLock.Lock()
defer s.startedLock.Unlock()
s.stopped = true // Don't want any new listeners
}()
s.controller.Run(stopCh)
controller.Run
此部分代码位于/vendor/k8s.io/client-go/tools/cache/controller.go
// Run begins processing items, and will continue until a value is sent down stopCh.
// It's an error to call Run more than once.
// Run blocks; call via go.
func (c *controller) Run(stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
go func() {
<-stopCh
c.config.Queue.Close()
}()
r := NewReflector(
c.config.ListerWatcher,
c.config.ObjectType,
c.config.Queue,
c.config.FullResyncPeriod,
)
r.ShouldResync = c.config.ShouldResync
r.clock = c.clock
c.reflectorMutex.Lock()
c.reflector = r
c.reflectorMutex.Unlock()
var wg wait.Group
defer wg.Wait()
wg.StartWithChannel(stopCh, r.Run)
wait.Until(c.processLoop, time.Second, stopCh)
}
核心代码:
// 构建Reflector
r := NewReflector(
c.config.ListerWatcher,
c.config.ObjectType,
c.config.Queue,
c.config.FullResyncPeriod,
)
// 运行Reflector
wg.StartWithChannel(stopCh, r.Run)
// 执行processLoop
wait.Until(c.processLoop, time.Second, stopCh)
3. Reflector
3.1. Reflector
Reflector的主要作用是watch指定的k8s资源,并将变化同步到本地是store中。Reflector只会放置指定的expectedType类型的资源到store中,除非expectedType为nil。如果resyncPeriod不为零,那么Reflector为以resyncPeriod为周期定期执行list的操作,这样就可以使用Reflector来定期处理所有的对象,也可以逐步处理变化的对象。
常用属性说明:
- expectedType:期望放入缓存store的资源类型。
- store:watch的资源对应的本地缓存。
- listerWatcher:list和watch的接口。
- period:watch的周期,默认为1秒。
- resyncPeriod:resync的周期,当非零的时候,会按该周期执行list。
- lastSyncResourceVersion:最新一次看到的资源的版本号,主要在watch时候使用。
// Reflector watches a specified resource and causes all changes to be reflected in the given store.
type Reflector struct {
// name identifies this reflector. By default it will be a file:line if possible.
name string
// metrics tracks basic metric information about the reflector
metrics *reflectorMetrics
// The type of object we expect to place in the store.
expectedType reflect.Type
// The destination to sync up with the watch source
store Store
// listerWatcher is used to perform lists and watches.
listerWatcher ListerWatcher
// period controls timing between one watch ending and
// the beginning of the next one.
period time.Duration
resyncPeriod time.Duration
ShouldResync func() bool
// clock allows tests to manipulate time
clock clock.Clock
// lastSyncResourceVersion is the resource version token last
// observed when doing a sync with the underlying store
// it is thread safe, but not synchronized with the underlying store
lastSyncResourceVersion string
// lastSyncResourceVersionMutex guards read/write access to lastSyncResourceVersion
lastSyncResourceVersionMutex sync.RWMutex
}
3.2. NewReflector
NewReflector主要用来构建Reflector的结构体。
此部分的代码位于/vendor/k8s.io/client-go/tools/cache/reflector.go
// NewReflector creates a new Reflector object which will keep the given store up to
// date with the server's contents for the given resource. Reflector promises to
// only put things in the store that have the type of expectedType, unless expectedType
// is nil. If resyncPeriod is non-zero, then lists will be executed after every
// resyncPeriod, so that you can use reflectors to periodically process everything as
// well as incrementally processing the things that change.
func NewReflector(lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {
return NewNamedReflector(getDefaultReflectorName(internalPackages...), lw, expectedType, store, resyncPeriod)
}
// reflectorDisambiguator is used to disambiguate started reflectors.
// initialized to an unstable value to ensure meaning isn't attributed to the suffix.
var reflectorDisambiguator = int64(time.Now().UnixNano() % 12345)
// NewNamedReflector same as NewReflector, but with a specified name for logging
func NewNamedReflector(name string, lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {
reflectorSuffix := atomic.AddInt64(&reflectorDisambiguator, 1)
r := &Reflector{
name: name,
// we need this to be unique per process (some names are still the same)but obvious who it belongs to
metrics: newReflectorMetrics(makeValidPromethusMetricLabel(fmt.Sprintf("reflector_"+name+"_%d", reflectorSuffix))),
listerWatcher: lw,
store: store,
expectedType: reflect.TypeOf(expectedType),
period: time.Second,
resyncPeriod: resyncPeriod,
clock: &clock.RealClock{},
}
return r
}
3.3. Reflector.Run
Reflector.Run主要执行了ListAndWatch的方法。
// Run starts a watch and handles watch events. Will restart the watch if it is closed.
// Run will exit when stopCh is closed.
func (r *Reflector) Run(stopCh <-chan struct{}) {
glog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedType, r.resyncPeriod, r.name)
wait.Until(func() {
if err := r.ListAndWatch(stopCh); err != nil {
utilruntime.HandleError(err)
}
}, r.period, stopCh)
}
3.4. ListAndWatch
ListAndWatch第一次会列出所有的对象,并获取资源对象的版本号,然后watch资源对象的版本号来查看是否有被变更。首先会将资源版本号设置为0,list()可能会导致本地的缓存相对于etcd里面的内容存在延迟,Reflector会通过watch的方法将延迟的部分补充上,使得本地的缓存数据与etcd的数据保持一致。
3.4.1. List
// ListAndWatch first lists all items and get the resource version at the moment of call,
// and then use the resource version to watch.
// It returns error if ListAndWatch didn't even try to initialize watch.
func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
glog.V(3).Infof("Listing and watching %v from %s", r.expectedType, r.name)
var resourceVersion string
// Explicitly set "0" as resource version - it's fine for the List()
// to be served from cache and potentially be delayed relative to
// etcd contents. Reflector framework will catch up via Watch() eventually.
options := metav1.ListOptions{ResourceVersion: "0"}
r.metrics.numberOfLists.Inc()
start := r.clock.Now()
list, err := r.listerWatcher.List(options)
if err != nil {
return fmt.Errorf("%s: Failed to list %v: %v", r.name, r.expectedType, err)
}
r.metrics.listDuration.Observe(time.Since(start).Seconds())
listMetaInterface, err := meta.ListAccessor(list)
if err != nil {
return fmt.Errorf("%s: Unable to understand list result %#v: %v", r.name, list, err)
}
resourceVersion = listMetaInterface.GetResourceVersion()
items, err := meta.ExtractList(list)
if err != nil {
return fmt.Errorf("%s: Unable to understand list result %#v (%v)", r.name, list, err)
}
r.metrics.numberOfItemsInList.Observe(float64(len(items)))
if err := r.syncWith(items, resourceVersion); err != nil {
return fmt.Errorf("%s: Unable to sync list result: %v", r.name, err)
}
r.setLastSyncResourceVersion(resourceVersion)
...
}
首先将资源的版本号设置为0,然后调用listerWatcher.List(options),列出所有list的内容。
// 版本号设置为0
options := metav1.ListOptions{ResourceVersion: "0"}
// list接口
list, err := r.listerWatcher.List(options)
获取资源版本号,并将list的内容提取成对象列表。
// 获取版本号
resourceVersion = listMetaInterface.GetResourceVersion()
// 将list的内容提取成对象列表
items, err := meta.ExtractList(list)
将list中对象列表的内容和版本号存储到本地的缓存store中,并全量替换已有的store的内容。
err := r.syncWith(items, resourceVersion)
syncWith调用了store的Replace的方法来替换原来store中的数据。
// syncWith replaces the store's items with the given list.
func (r *Reflector) syncWith(items []runtime.Object, resourceVersion string) error {
found := make([]interface{}, 0, len(items))
for _, item := range items {
found = append(found, item)
}
return r.store.Replace(found, resourceVersion)
}
Store.Replace方法定义如下:
type Store interface {
...
// Replace will delete the contents of the store, using instead the
// given list. Store takes ownership of the list, you should not reference
// it after calling this function.
Replace([]interface{}, string) error
...
}
最后设置最新的资源版本号。
r.setLastSyncResourceVersion(resourceVersion)
setLastSyncResourceVersion:
func (r *Reflector) setLastSyncResourceVersion(v string) {
r.lastSyncResourceVersionMutex.Lock()
defer r.lastSyncResourceVersionMutex.Unlock()
r.lastSyncResourceVersion = v
rv, err := strconv.Atoi(v)
if err == nil {
r.metrics.lastResourceVersion.Set(float64(rv))
}
}
3.4.2. store.Resync
resyncerrc := make(chan error, 1)
cancelCh := make(chan struct{})
defer close(cancelCh)
go func() {
resyncCh, cleanup := r.resyncChan()
defer func() {
cleanup() // Call the last one written into cleanup
}()
for {
select {
case <-resyncCh:
case <-stopCh:
return
case <-cancelCh:
return
}
if r.ShouldResync == nil || r.ShouldResync() {
glog.V(4).Infof("%s: forcing resync", r.name)
if err := r.store.Resync(); err != nil {
resyncerrc <- err
return
}
}
cleanup()
resyncCh, cleanup = r.resyncChan()
}
}()
核心代码:
err := r.store.Resync()
store的具体对象为DeltaFIFO,即调用DeltaFIFO.Resync
// Resync will send a sync event for each item
func (f *DeltaFIFO) Resync() error {
f.lock.Lock()
defer f.lock.Unlock()
if f.knownObjects == nil {
return nil
}
keys := f.knownObjects.ListKeys()
for _, k := range keys {
if err := f.syncKeyLocked(k); err != nil {
return err
}
}
return nil
}
3.4.3. Watch
for {
// give the stopCh a chance to stop the loop, even in case of continue statements further down on errors
select {
case <-stopCh:
return nil
default:
}
timemoutseconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0))
options = metav1.ListOptions{
ResourceVersion: resourceVersion,
// We want to avoid situations of hanging watchers. Stop any wachers that do not
// receive any events within the timeout window.
TimeoutSeconds: &timemoutseconds,
}
r.metrics.numberOfWatches.Inc()
w, err := r.listerWatcher.Watch(options)
if err != nil {
switch err {
case io.EOF:
// watch closed normally
case io.ErrUnexpectedEOF:
glog.V(1).Infof("%s: Watch for %v closed with unexpected EOF: %v", r.name, r.expectedType, err)
default:
utilruntime.HandleError(fmt.Errorf("%s: Failed to watch %v: %v", r.name, r.expectedType, err))
}
// If this is "connection refused" error, it means that most likely apiserver is not responsive.
// It doesn't make sense to re-list all objects because most likely we will be able to restart
// watch where we ended.
// If that's the case wait and resend watch request.
if urlError, ok := err.(*url.Error); ok {
if opError, ok := urlError.Err.(*net.OpError); ok {
if errno, ok := opError.Err.(syscall.Errno); ok && errno == syscall.ECONNREFUSED {
time.Sleep(time.Second)
continue
}
}
}
return nil
}
if err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil {
if err != errorStopRequested {
glog.Warningf("%s: watch of %v ended with: %v", r.name, r.expectedType, err)
}
return nil
}
}
设置watch的超时时间,默认为5分钟。
timemoutseconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0))
options = metav1.ListOptions{
ResourceVersion: resourceVersion,
// We want to avoid situations of hanging watchers. Stop any wachers that do not
// receive any events within the timeout window.
TimeoutSeconds: &timemoutseconds,
}
执行listerWatcher.Watch(options)。
w, err := r.listerWatcher.Watch(options)
执行watchHandler。
err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh)
3.4.4. watchHandler
watchHandler主要是通过watch的方式保证当前的资源版本是最新的。
// watchHandler watches w and keeps *resourceVersion up to date.
func (r *Reflector) watchHandler(w watch.Interface, resourceVersion *string, errc chan error, stopCh <-chan struct{}) error {
start := r.clock.Now()
eventCount := 0
// Stopping the watcher should be idempotent and if we return from this function there's no way
// we're coming back in with the same watch interface.
defer w.Stop()
// update metrics
defer func() {
r.metrics.numberOfItemsInWatch.Observe(float64(eventCount))
r.metrics.watchDuration.Observe(time.Since(start).Seconds())
}()
loop:
for {
select {
case <-stopCh:
return errorStopRequested
case err := <-errc:
return err
case event, ok := <-w.ResultChan():
if !ok {
break loop
}
if event.Type == watch.Error {
return apierrs.FromObject(event.Object)
}
if e, a := r.expectedType, reflect.TypeOf(event.Object); e != nil && e != a {
utilruntime.HandleError(fmt.Errorf("%s: expected type %v, but watch event object had type %v", r.name, e, a))
continue
}
meta, err := meta.Accessor(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to understand watch event %#v", r.name, event))
continue
}
newResourceVersion := meta.GetResourceVersion()
switch event.Type {
case watch.Added:
err := r.store.Add(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to add watch event object (%#v) to store: %v", r.name, event.Object, err))
}
case watch.Modified:
err := r.store.Update(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to update watch event object (%#v) to store: %v", r.name, event.Object, err))
}
case watch.Deleted:
// TODO: Will any consumers need access to the "last known
// state", which is passed in event.Object? If so, may need
// to change this.
err := r.store.Delete(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to delete watch event object (%#v) from store: %v", r.name, event.Object, err))
}
default:
utilruntime.HandleError(fmt.Errorf("%s: unable to understand watch event %#v", r.name, event))
}
*resourceVersion = newResourceVersion
r.setLastSyncResourceVersion(newResourceVersion)
eventCount++
}
}
watchDuration := r.clock.Now().Sub(start)
if watchDuration < 1*time.Second && eventCount == 0 {
r.metrics.numberOfShortWatches.Inc()
return fmt.Errorf("very short watch: %s: Unexpected watch close - watch lasted less than a second and no items received", r.name)
}
glog.V(4).Infof("%s: Watch close - %v total %v items received", r.name, r.expectedType, eventCount)
return nil
}
获取watch接口中的事件的channel,来获取事件的内容。
for {
select {
...
case event, ok := <-w.ResultChan():
...
}
当获得添加、更新、删除的事件时,将对应的对象更新到本地缓存store中。
switch event.Type {
case watch.Added:
err := r.store.Add(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to add watch event object (%#v) to store: %v", r.name, event.Object, err))
}
case watch.Modified:
err := r.store.Update(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to update watch event object (%#v) to store: %v", r.name, event.Object, err))
}
case watch.Deleted:
// TODO: Will any consumers need access to the "last known
// state", which is passed in event.Object? If so, may need
// to change this.
err := r.store.Delete(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to delete watch event object (%#v) from store: %v", r.name, event.Object, err))
}
default:
utilruntime.HandleError(fmt.Errorf("%s: unable to understand watch event %#v", r.name, event))
}
更新当前的最新版本号。
newResourceVersion := meta.GetResourceVersion()
*resourceVersion = newResourceVersion
r.setLastSyncResourceVersion(newResourceVersion)
通过对Reflector模块的分析,可以看到多次使用到本地缓存store模块,而store的数据由DeltaFIFO赋值而来,以下针对DeltaFIFO和store做分析。
4. DeltaFIFO
DeltaFIFO由NewDeltaFIFO初始化,并赋值给config.Queue。
func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, nil, s.indexer)
cfg := &Config{
Queue: fifo,
...
}
...
}
4.1. NewDeltaFIFO
// NewDeltaFIFO returns a Store which can be used process changes to items.
//
// keyFunc is used to figure out what key an object should have. (It's
// exposed in the returned DeltaFIFO's KeyOf() method, with bonus features.)
//
// 'compressor' may compress as many or as few items as it wants
// (including returning an empty slice), but it should do what it
// does quickly since it is called while the queue is locked.
// 'compressor' may be nil if you don't want any delta compression.
//
// 'keyLister' is expected to return a list of keys that the consumer of
// this queue "knows about". It is used to decide which items are missing
// when Replace() is called; 'Deleted' deltas are produced for these items.
// It may be nil if you don't need to detect all deletions.
// TODO: consider merging keyLister with this object, tracking a list of
// "known" keys when Pop() is called. Have to think about how that
// affects error retrying.
// TODO(lavalamp): I believe there is a possible race only when using an
// external known object source that the above TODO would
// fix.
//
// Also see the comment on DeltaFIFO.
func NewDeltaFIFO(keyFunc KeyFunc, compressor DeltaCompressor, knownObjects KeyListerGetter) *DeltaFIFO {
f := &DeltaFIFO{
items: map[string]Deltas{},
queue: []string{},
keyFunc: keyFunc,
deltaCompressor: compressor,
knownObjects: knownObjects,
}
f.cond.L = &f.lock
return f
}
controller.Run的部分调用了NewReflector。
func (c *controller) Run(stopCh <-chan struct{}) {
...
r := NewReflector(
c.config.ListerWatcher,
c.config.ObjectType,
c.config.Queue,
c.config.FullResyncPeriod,
)
...
}
NewReflector构造函数,将c.config.Queue赋值给Reflector.store的属性。
func NewReflector(lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {
return NewNamedReflector(getDefaultReflectorName(internalPackages...), lw, expectedType, store, resyncPeriod)
}
// NewNamedReflector same as NewReflector, but with a specified name for logging
func NewNamedReflector(name string, lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {
reflectorSuffix := atomic.AddInt64(&reflectorDisambiguator, 1)
r := &Reflector{
name: name,
// we need this to be unique per process (some names are still the same)but obvious who it belongs to
metrics: newReflectorMetrics(makeValidPromethusMetricLabel(fmt.Sprintf("reflector_"+name+"_%d", reflectorSuffix))),
listerWatcher: lw,
store: store,
expectedType: reflect.TypeOf(expectedType),
period: time.Second,
resyncPeriod: resyncPeriod,
clock: &clock.RealClock{},
}
return r
}
4.2. DeltaFIFO
DeltaFIFO是一个生产者与消费者的队列,其中Reflector是生产者,消费者调用Pop()的方法。
DeltaFIFO主要用在以下场景:
- 希望对象变更最多处理一次
- 处理对象时,希望查看自上次处理对象以来发生的所有事情
- 要处理对象的删除
- 希望定期重新处理对象
// DeltaFIFO is like FIFO, but allows you to process deletes.
//
// DeltaFIFO is a producer-consumer queue, where a Reflector is
// intended to be the producer, and the consumer is whatever calls
// the Pop() method.
//
// DeltaFIFO solves this use case:
// * You want to process every object change (delta) at most once.
// * When you process an object, you want to see everything
// that's happened to it since you last processed it.
// * You want to process the deletion of objects.
// * You might want to periodically reprocess objects.
//
// DeltaFIFO's Pop(), Get(), and GetByKey() methods return
// interface{} to satisfy the Store/Queue interfaces, but it
// will always return an object of type Deltas.
//
// A note on threading: If you call Pop() in parallel from multiple
// threads, you could end up with multiple threads processing slightly
// different versions of the same object.
//
// A note on the KeyLister used by the DeltaFIFO: It's main purpose is
// to list keys that are "known", for the purpose of figuring out which
// items have been deleted when Replace() or Delete() are called. The deleted
// object will be included in the DeleteFinalStateUnknown markers. These objects
// could be stale.
//
// You may provide a function to compress deltas (e.g., represent a
// series of Updates as a single Update).
type DeltaFIFO struct {
// lock/cond protects access to 'items' and 'queue'.
lock sync.RWMutex
cond sync.Cond
// We depend on the property that items in the set are in
// the queue and vice versa, and that all Deltas in this
// map have at least one Delta.
items map[string]Deltas
queue []string
// populated is true if the first batch of items inserted by Replace() has been populated
// or Delete/Add/Update was called first.
populated bool
// initialPopulationCount is the number of items inserted by the first call of Replace()
initialPopulationCount int
// keyFunc is used to make the key used for queued item
// insertion and retrieval, and should be deterministic.
keyFunc KeyFunc
// deltaCompressor tells us how to combine two or more
// deltas. It may be nil.
deltaCompressor DeltaCompressor
// knownObjects list keys that are "known", for the
// purpose of figuring out which items have been deleted
// when Replace() or Delete() is called.
knownObjects KeyListerGetter
// Indication the queue is closed.
// Used to indicate a queue is closed so a control loop can exit when a queue is empty.
// Currently, not used to gate any of CRED operations.
closed bool
closedLock sync.Mutex
}
4.3. Queue & Store
DeltaFIFO的类型是Queue接口,Reflector.store是Store接口,Queue接口是一个存储队列,Process的方法执行Queue.Pop出来的数据对象,
// Queue is exactly like a Store, but has a Pop() method too.
type Queue interface {
Store
// Pop blocks until it has something to process.
// It returns the object that was process and the result of processing.
// The PopProcessFunc may return an ErrRequeue{...} to indicate the item
// should be requeued before releasing the lock on the queue.
Pop(PopProcessFunc) (interface{}, error)
// AddIfNotPresent adds a value previously
// returned by Pop back into the queue as long
// as nothing else (presumably more recent)
// has since been added.
AddIfNotPresent(interface{}) error
// Return true if the first batch of items has been popped
HasSynced() bool
// Close queue
Close()
}
5. store
Store是一个通用的存储接口,Reflector通过watch server的方式更新数据到store中,store给Reflector提供本地的缓存,让Reflector可以像消息队列一样的工作。
Store实现的是一种可以准确的写入对象和获取对象的机制。
// Store is a generic object storage interface. Reflector knows how to watch a server
// and update a store. A generic store is provided, which allows Reflector to be used
// as a local caching system, and an LRU store, which allows Reflector to work like a
// queue of items yet to be processed.
//
// Store makes no assumptions about stored object identity; it is the responsibility
// of a Store implementation to provide a mechanism to correctly key objects and to
// define the contract for obtaining objects by some arbitrary key type.
type Store interface {
Add(obj interface{}) error
Update(obj interface{}) error
Delete(obj interface{}) error
List() []interface{}
ListKeys() []string
Get(obj interface{}) (item interface{}, exists bool, err error)
GetByKey(key string) (item interface{}, exists bool, err error)
// Replace will delete the contents of the store, using instead the
// given list. Store takes ownership of the list, you should not reference
// it after calling this function.
Replace([]interface{}, string) error
Resync() error
}
其中Replace方法会删除原来store中的内容,并将新增的list的内容存入store中,即完全替换数据。
6.1. cache
cache实现了store的接口,而cache的具体实现又是调用ThreadSafeStore接口来实现功能的。
cache的功能主要有以下两点:
- 通过keyFunc计算对象的key
- 调用ThreadSafeStorage接口的方法
// cache responsibilities are limited to:
// 1. Computing keys for objects via keyFunc
// 2. Invoking methods of a ThreadSafeStorage interface
type cache struct {
// cacheStorage bears the burden of thread safety for the cache
cacheStorage ThreadSafeStore
// keyFunc is used to make the key for objects stored in and retrieved from items, and
// should be deterministic.
keyFunc KeyFunc
}
其中ListAndWatch主要用到以下的方法:
cache.Replace
// Replace will delete the contents of 'c', using instead the given list.
// 'c' takes ownership of the list, you should not reference the list again
// after calling this function.
func (c *cache) Replace(list []interface{}, resourceVersion string) error {
items := map[string]interface{}{}
for _, item := range list {
key, err := c.keyFunc(item)
if err != nil {
return KeyError{item, err}
}
items[key] = item
}
c.cacheStorage.Replace(items, resourceVersion)
return nil
}
cache.Add
// Add inserts an item into the cache.
func (c *cache) Add(obj interface{}) error {
key, err := c.keyFunc(obj)
if err != nil {
return KeyError{obj, err}
}
c.cacheStorage.Add(key, obj)
return nil
}
cache.Update
// Update sets an item in the cache to its updated state.
func (c *cache) Update(obj interface{}) error {
key, err := c.keyFunc(obj)
if err != nil {
return KeyError{obj, err}
}
c.cacheStorage.Update(key, obj)
return nil
}
cache.Delete
// Delete removes an item from the cache.
func (c *cache) Delete(obj interface{}) error {
key, err := c.keyFunc(obj)
if err != nil {
return KeyError{obj, err}
}
c.cacheStorage.Delete(key)
return nil
}
6.2. ThreadSafeStore
cache的具体是调用ThreadSafeStore来实现的。
// ThreadSafeStore is an interface that allows concurrent access to a storage backend.
// TL;DR caveats: you must not modify anything returned by Get or List as it will break
// the indexing feature in addition to not being thread safe.
//
// The guarantees of thread safety provided by List/Get are only valid if the caller
// treats returned items as read-only. For example, a pointer inserted in the store
// through `Add` will be returned as is by `Get`. Multiple clients might invoke `Get`
// on the same key and modify the pointer in a non-thread-safe way. Also note that
// modifying objects stored by the indexers (if any) will *not* automatically lead
// to a re-index. So it's not a good idea to directly modify the objects returned by
// Get/List, in general.
type ThreadSafeStore interface {
Add(key string, obj interface{})
Update(key string, obj interface{})
Delete(key string)
Get(key string) (item interface{}, exists bool)
List() []interface{}
ListKeys() []string
Replace(map[string]interface{}, string)
Index(indexName string, obj interface{}) ([]interface{}, error)
IndexKeys(indexName, indexKey string) ([]string, error)
ListIndexFuncValues(name string) []string
ByIndex(indexName, indexKey string) ([]interface{}, error)
GetIndexers() Indexers
// AddIndexers adds more indexers to this store. If you call this after you already have data
// in the store, the results are undefined.
AddIndexers(newIndexers Indexers) error
Resync() error
}
threadSafeMap
// threadSafeMap implements ThreadSafeStore
type threadSafeMap struct {
lock sync.RWMutex
items map[string]interface{}
// indexers maps a name to an IndexFunc
indexers Indexers
// indices maps a name to an Index
indices Indices
}
6. processLoop
func (c *controller) Run(stopCh <-chan struct{}) {
...
wait.Until(c.processLoop, time.Second, stopCh)
}
在controller.Run方法中会调用processLoop,以下分析processLoop的处理逻辑。
// processLoop drains the work queue.
// TODO: Consider doing the processing in parallel. This will require a little thought
// to make sure that we don't end up processing the same object multiple times
// concurrently.
//
// TODO: Plumb through the stopCh here (and down to the queue) so that this can
// actually exit when the controller is stopped. Or just give up on this stuff
// ever being stoppable. Converting this whole package to use Context would
// also be helpful.
func (c *controller) processLoop() {
for {
obj, err := c.config.Queue.Pop(PopProcessFunc(c.config.Process))
if err != nil {
if err == FIFOClosedError {
return
}
if c.config.RetryOnError {
// This is the safe way to re-enqueue.
c.config.Queue.AddIfNotPresent(obj)
}
}
}
}
processLoop主要处理任务队列中的任务,其中处理逻辑是调用具体的ProcessFunc函数来实现,核心代码为:
obj, err := c.config.Queue.Pop(PopProcessFunc(c.config.Process))
5.1. DeltaFIFO.Pop
Pop会阻塞住直到队列里面添加了新的对象,如果有多个对象,按照先进先出的原则处理,如果某个对象没有处理成功会重新被加入该队列中。
Pop中会调用具体的process函数来处理对象。
// Pop blocks until an item is added to the queue, and then returns it. If
// multiple items are ready, they are returned in the order in which they were
// added/updated. The item is removed from the queue (and the store) before it
// is returned, so if you don't successfully process it, you need to add it back
// with AddIfNotPresent().
// process function is called under lock, so it is safe update data structures
// in it that need to be in sync with the queue (e.g. knownKeys). The PopProcessFunc
// may return an instance of ErrRequeue with a nested error to indicate the current
// item should be requeued (equivalent to calling AddIfNotPresent under the lock).
//
// Pop returns a 'Deltas', which has a complete list of all the things
// that happened to the object (deltas) while it was sitting in the queue.
func (f *DeltaFIFO) Pop(process PopProcessFunc) (interface{}, error) {
f.lock.Lock()
defer f.lock.Unlock()
for {
for len(f.queue) == 0 {
// When the queue is empty, invocation of Pop() is blocked until new item is enqueued.
// When Close() is called, the f.closed is set and the condition is broadcasted.
// Which causes this loop to continue and return from the Pop().
if f.IsClosed() {
return nil, FIFOClosedError
}
f.cond.Wait()
}
id := f.queue[0]
f.queue = f.queue[1:]
item, ok := f.items[id]
if f.initialPopulationCount > 0 {
f.initialPopulationCount--
}
if !ok {
// Item may have been deleted subsequently.
continue
}
delete(f.items, id)
err := process(item)
if e, ok := err.(ErrRequeue); ok {
f.addIfNotPresent(id, item)
err = e.Err
}
// Don't need to copyDeltas here, because we're transferring
// ownership to the caller.
return item, err
}
}
核心代码:
for {
...
item, ok := f.items[id]
...
err := process(item)
if e, ok := err.(ErrRequeue); ok {
f.addIfNotPresent(id, item)
err = e.Err
}
// Don't need to copyDeltas here, because we're transferring
// ownership to the caller.
return item, err
}
5.2. HandleDeltas
cfg := &Config{
Queue: fifo,
ListerWatcher: s.listerWatcher,
ObjectType: s.objectType,
FullResyncPeriod: s.resyncCheckPeriod,
RetryOnError: false,
ShouldResync: s.processor.shouldResync,
Process: s.HandleDeltas,
}
其中process函数就是在sharedIndexInformer.Run方法中,给config.Process赋值的HandleDeltas函数。
func (s *sharedIndexInformer) HandleDeltas(obj interface{}) error {
s.blockDeltas.Lock()
defer s.blockDeltas.Unlock()
// from oldest to newest
for _, d := range obj.(Deltas) {
switch d.Type {
case Sync, Added, Updated:
isSync := d.Type == Sync
s.cacheMutationDetector.AddObject(d.Object)
if old, exists, err := s.indexer.Get(d.Object); err == nil && exists {
if err := s.indexer.Update(d.Object); err != nil {
return err
}
s.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync)
} else {
if err := s.indexer.Add(d.Object); err != nil {
return err
}
s.processor.distribute(addNotification{newObj: d.Object}, isSync)
}
case Deleted:
if err := s.indexer.Delete(d.Object); err != nil {
return err
}
s.processor.distribute(deleteNotification{oldObj: d.Object}, false)
}
}
return nil
}
核心代码:
switch d.Type {
case Sync, Added, Updated:
...
if old, exists, err := s.indexer.Get(d.Object); err == nil && exists {
...
s.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync)
} else {
...
s.processor.distribute(addNotification{newObj: d.Object}, isSync)
}
case Deleted:
...
s.processor.distribute(deleteNotification{oldObj: d.Object}, false)
}
根据不同的类型,调用processor.distribute方法,该方法将对象加入processorListener的channel中。
5.3. sharedProcessor.distribute
func (p *sharedProcessor) distribute(obj interface{}, sync bool) {
p.listenersLock.RLock()
defer p.listenersLock.RUnlock()
if sync {
for _, listener := range p.syncingListeners {
listener.add(obj)
}
} else {
for _, listener := range p.listeners {
listener.add(obj)
}
}
}
processorListener.add:
func (p *processorListener) add(notification interface{}) {
p.addCh <- notification
}
综合以上的分析,可以看出processLoop通过调用HandleDeltas,再调用distribute,processorListener.add最终将不同更新类型的对象加入processorListener的channel中,供processorListener.Run使用。以下分析processorListener.Run的部分。
7. processor
processor的主要功能就是记录了所有的回调函数实例(即 ResourceEventHandler 实例),并负责触发这些函数。在sharedIndexInformer.Run部分会调用processor.run。
流程:
- listenser的add函数负责将notify装进pendingNotifications。
- pop函数取出pendingNotifications的第一个nofify,输出到nextCh channel。
- run函数则负责取出notify,然后根据notify的类型(增加、删除、更新)触发相应的处理函数,这些函数是在不同的
NewXxxcontroller实现中注册的。
func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
...
wg.StartWithChannel(processorStopCh, s.processor.run)
...
}
7.1. sharedProcessor.Run
func (p *sharedProcessor) run(stopCh <-chan struct{}) {
func() {
p.listenersLock.RLock()
defer p.listenersLock.RUnlock()
for _, listener := range p.listeners {
p.wg.Start(listener.run)
p.wg.Start(listener.pop)
}
}()
<-stopCh
p.listenersLock.RLock()
defer p.listenersLock.RUnlock()
for _, listener := range p.listeners {
close(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop
}
p.wg.Wait() // Wait for all .pop() and .run() to stop
}
7.1.1. listener.pop
pop函数取出pendingNotifications的第一个nofify,输出到nextCh channel。
func (p *processorListener) pop() {
defer utilruntime.HandleCrash()
defer close(p.nextCh) // Tell .run() to stop
var nextCh chan<- interface{}
var notification interface{}
for {
select {
case nextCh <- notification:
// Notification dispatched
var ok bool
notification, ok = p.pendingNotifications.ReadOne()
if !ok { // Nothing to pop
nextCh = nil // Disable this select case
}
case notificationToAdd, ok := <-p.addCh:
if !ok {
return
}
if notification == nil { // No notification to pop (and pendingNotifications is empty)
// Optimize the case - skip adding to pendingNotifications
notification = notificationToAdd
nextCh = p.nextCh
} else { // There is already a notification waiting to be dispatched
p.pendingNotifications.WriteOne(notificationToAdd)
}
}
}
}
7.1.2. listener.run
listener.run部分根据不同的更新类型调用不同的处理函数。
func (p *processorListener) run() {
defer utilruntime.HandleCrash()
for next := range p.nextCh {
switch notification := next.(type) {
case updateNotification:
p.handler.OnUpdate(notification.oldObj, notification.newObj)
case addNotification:
p.handler.OnAdd(notification.newObj)
case deleteNotification:
p.handler.OnDelete(notification.oldObj)
default:
utilruntime.HandleError(fmt.Errorf("unrecognized notification: %#v", next))
}
}
}
其中具体的实现函数handler是在NewDeploymentController(其他不同类型的controller类似)中赋值的,而该handler是一个接口,具体如下:
// ResourceEventHandler can handle notifications for events that happen to a
// resource. The events are informational only, so you can't return an
// error.
// * OnAdd is called when an object is added.
// * OnUpdate is called when an object is modified. Note that oldObj is the
// last known state of the object-- it is possible that several changes
// were combined together, so you can't use this to see every single
// change. OnUpdate is also called when a re-list happens, and it will
// get called even if nothing changed. This is useful for periodically
// evaluating or syncing something.
// * OnDelete will get the final state of the item if it is known, otherwise
// it will get an object of type DeletedFinalStateUnknown. This can
// happen if the watch is closed and misses the delete event and we don't
// notice the deletion until the subsequent re-list.
type ResourceEventHandler interface {
OnAdd(obj interface{})
OnUpdate(oldObj, newObj interface{})
OnDelete(obj interface{})
}
7.2. ResourceEventHandler
以下以DeploymentController的处理逻辑为例。
在NewDeploymentController部分会注册deployment的事件函数,以下注册了三种类型的事件函数,其中包括:dInformer、rsInformer和podInformer。
// NewDeploymentController creates a new DeploymentController.
func NewDeploymentController(dInformer extensionsinformers.DeploymentInformer, rsInformer extensionsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {
...
dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: dc.addDeployment,
UpdateFunc: dc.updateDeployment,
// This will enter the sync loop and no-op, because the deployment has been deleted from the store.
DeleteFunc: dc.deleteDeployment,
})
rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: dc.addReplicaSet,
UpdateFunc: dc.updateReplicaSet,
DeleteFunc: dc.deleteReplicaSet,
})
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
DeleteFunc: dc.deletePod,
})
...
}
7.2.1. addDeployment
以下以addDeployment为例,addDeployment主要是将对象加入到enqueueDeployment的队列中。
func (dc *DeploymentController) addDeployment(obj interface{}) {
d := obj.(*extensions.Deployment)
glog.V(4).Infof("Adding deployment %s", d.Name)
dc.enqueueDeployment(d)
}
enqueueDeployment的定义
type DeploymentController struct {
...
enqueueDeployment func(deployment *extensions.Deployment)
...
}
将dc.enqueue赋值给dc.enqueueDeployment
dc.enqueueDeployment = dc.enqueue
dc.enqueue调用了dc.queue.Add(key)
func (dc *DeploymentController) enqueue(deployment *extensions.Deployment) {
key, err := controller.KeyFunc(deployment)
if err != nil {
utilruntime.HandleError(fmt.Errorf("Couldn't get key for object %#v: %v", deployment, err))
return
}
dc.queue.Add(key)
}
dc.queue主要记录了需要被同步的deployment的对象,供syncDeployment使用。
dc := &DeploymentController{
...
queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"),
}
NewNamedRateLimitingQueue
func NewNamedRateLimitingQueue(rateLimiter RateLimiter, name string) RateLimitingInterface {
return &rateLimitingType{
DelayingInterface: NewNamedDelayingQueue(name),
rateLimiter: rateLimiter,
}
}
通过以上分析,可以看出processor记录了不同类似的事件函数,其中事件函数在NewXxxController构造函数部分注册,具体事件函数的处理,一般是将需要处理的对象加入对应的controller的任务队列中,然后由类似syncDeployment的同步函数来维持期望状态的同步逻辑。
8. 总结
本文分析的部分主要是k8s的informer机制,即List-Watch机制。
8.1. Reflector
Reflector的主要作用是watch指定的k8s资源,并将变化同步到本地是store中。Reflector只会放置指定的expectedType类型的资源到store中,除非expectedType为nil。如果resyncPeriod不为零,那么Reflector为以resyncPeriod为周期定期执行list的操作,这样就可以使用Reflector来定期处理所有的对象,也可以逐步处理变化的对象。
8.2. ListAndWatch
ListAndWatch第一次会列出所有的对象,并获取资源对象的版本号,然后watch资源对象的版本号来查看是否有被变更。首先会将资源版本号设置为0,list()可能会导致本地的缓存相对于etcd里面的内容存在延迟,Reflector会通过watch的方法将延迟的部分补充上,使得本地的缓存数据与etcd的数据保持一致。
8.3. DeltaFIFO
DeltaFIFO是一个生产者与消费者的队列,其中Reflector是生产者,消费者调用Pop()的方法。
DeltaFIFO主要用在以下场景:
- 希望对象变更最多处理一次
- 处理对象时,希望查看自上次处理对象以来发生的所有事情
- 要处理对象的删除
- 希望定期重新处理对象
8.4. store
Store是一个通用的存储接口,Reflector通过watch server的方式更新数据到store中,store给Reflector提供本地的缓存,让Reflector可以像消息队列一样的工作。
Store实现的是一种可以准确的写入对象和获取对象的机制。
8.5. processor
processor的主要功能就是记录了所有的回调函数实例(即 ResourceEventHandler 实例),并负责触发这些函数。在sharedIndexInformer.Run部分会调用processor.run。
流程:
- listenser的add函数负责将notify装进pendingNotifications。
- pop函数取出pendingNotifications的第一个nofify,输出到nextCh channel。
- run函数则负责取出notify,然后根据notify的类型(增加、删除、更新)触发相应的处理函数,这些函数是在不同的
NewXxxcontroller实现中注册的。
processor记录了不同类似的事件函数,其中事件函数在NewXxxController构造函数部分注册,具体事件函数的处理,一般是将需要处理的对象加入对应的controller的任务队列中,然后由类似syncDeployment的同步函数来维持期望状态的同步逻辑。
8.6. 主要步骤
- 在controller-manager的Run函数部分调用了InformerFactory.Start的方法,Start方法初始化各种类型的informer,并且每个类型起了个informer.Run的goroutine。
- informer.Run的部分先生成一个DeltaFIFO的队列来存储对象变化的数据。然后调用processor.Run和controller.Run函数。
- controller.Run函数会生成一个Reflector,
Reflector的主要作用是watch指定的k8s资源,并将变化同步到本地是store中。Reflector以resyncPeriod为周期定期执行list的操作,这样就可以使用Reflector来定期处理所有的对象,也可以逐步处理变化的对象。 - Reflector接着执行ListAndWatch函数,ListAndWatch第一次会列出所有的对象,并获取资源对象的版本号,然后watch资源对象的版本号来查看是否有被变更。首先会将资源版本号设置为0,
list()可能会导致本地的缓存相对于etcd里面的内容存在延迟,Reflector会通过watch的方法将延迟的部分补充上,使得本地的缓存数据与etcd的数据保持一致。 - controller.Run函数还会调用processLoop函数,processLoop通过调用HandleDeltas,再调用distribute,processorListener.add最终将不同更新类型的对象加入
processorListener的channel中,供processorListener.Run使用。 - processor的主要功能就是记录了所有的回调函数实例(即 ResourceEventHandler 实例),并负责触发这些函数。processor记录了不同类型的事件函数,其中事件函数在NewXxxController构造函数部分注册,具体事件函数的处理,一般是将需要处理的对象加入对应的controller的任务队列中,然后由类似syncDeployment的同步函数来维持期望状态的同步逻辑。
参考文章:
11.4 -
11.4.1 -
kube-scheduler源码分析(四)之 findNodesThatFit
以下代码分析基于
kubernetes v1.12.0版本。
本文主要分析调度逻辑中的预选策略,即第一步筛选出符合pod调度条件的节点。
1. 调用入口
预选,通过预选函数来判断每个节点是否适合被该Pod调度。
genericScheduler.Schedule中对findNodesThatFit的调用过程如下:
此部分代码位于pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
...
// 列出所有的节点
nodes, err := nodeLister.List()
if err != nil {
return "", err
}
if len(nodes) == 0 {
return "", ErrNoNodesAvailable
}
// Used for all fit and priority funcs.
err = g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
if err != nil {
return "", err
}
trace.Step("Computing predicates")
startPredicateEvalTime := time.Now()
// 调用findNodesThatFit过滤出预选节点
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
if err != nil {
return "", err
}
if len(filteredNodes) == 0 {
return "", &FitError{
Pod: pod,
NumAllNodes: len(nodes),
FailedPredicates: failedPredicateMap,
}
}
// metrics
metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))
...
}
核心代码:
// 调用findNodesThatFit过滤出预选节点
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
2. findNodesThatFit
findNodesThatFit基于给定的预选函数过滤node,每个node传入到预选函数中来确实该节点是否符合要求。
findNodesThatFit的入参是被调度的pod和当前的节点列表,返回预选节点列表和错误。
findNodesThatFit基本流程如下:
- 设置可行节点的总数,作为预选节点数组的容量,避免总节点过多需要筛选的节点过多。
- 通过
NodeTree不断获取下一个节点来判断该节点是否满足pod的调度条件。 - 通过之前注册的各种预选函数来判断当前节点是否符合pod的调度条件。
- 最后返回满足调度条件的node列表,供下一步的优选操作。
findNodesThatFit完整代码如下:
此部分代码位于pkg/scheduler/core/generic_scheduler.go
// Filters the nodes to find the ones that fit based on the given predicate functions
// Each node is passed through the predicate functions to determine if it is a fit
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
var filtered []*v1.Node
failedPredicateMap := FailedPredicateMap{}
if len(g.predicates) == 0 {
filtered = nodes
} else {
allNodes := int32(g.cache.NodeTree().NumNodes)
numNodesToFind := g.numFeasibleNodesToFind(allNodes)
// Create filtered list with enough space to avoid growing it
// and allow assigning.
filtered = make([]*v1.Node, numNodesToFind)
errs := errors.MessageCountMap{}
var (
predicateResultLock sync.Mutex
filteredLen int32
equivClass *equivalence.Class
)
ctx, cancel := context.WithCancel(context.Background())
// We can use the same metadata producer for all nodes.
meta := g.predicateMetaProducer(pod, g.cachedNodeInfoMap)
if g.equivalenceCache != nil {
// getEquivalenceClassInfo will return immediately if no equivalence pod found
equivClass = equivalence.NewClass(pod)
}
checkNode := func(i int) {
var nodeCache *equivalence.NodeCache
nodeName := g.cache.NodeTree().Next()
if g.equivalenceCache != nil {
nodeCache, _ = g.equivalenceCache.GetNodeCache(nodeName)
}
fits, failedPredicates, err := podFitsOnNode(
pod,
meta,
g.cachedNodeInfoMap[nodeName],
g.predicates,
g.cache,
nodeCache,
g.schedulingQueue,
g.alwaysCheckAllPredicates,
equivClass,
)
if err != nil {
predicateResultLock.Lock()
errs[err.Error()]++
predicateResultLock.Unlock()
return
}
if fits {
length := atomic.AddInt32(&filteredLen, 1)
if length > numNodesToFind {
cancel()
atomic.AddInt32(&filteredLen, -1)
} else {
filtered[length-1] = g.cachedNodeInfoMap[nodeName].Node()
}
} else {
predicateResultLock.Lock()
failedPredicateMap[nodeName] = failedPredicates
predicateResultLock.Unlock()
}
}
// Stops searching for more nodes once the configured number of feasible nodes
// are found.
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)
filtered = filtered[:filteredLen]
if len(errs) > 0 {
return []*v1.Node{}, FailedPredicateMap{}, errors.CreateAggregateFromMessageCountMap(errs)
}
}
if len(filtered) > 0 && len(g.extenders) != 0 {
for _, extender := range g.extenders {
if !extender.IsInterested(pod) {
continue
}
filteredList, failedMap, err := extender.Filter(pod, filtered, g.cachedNodeInfoMap)
if err != nil {
if extender.IsIgnorable() {
glog.Warningf("Skipping extender %v as it returned error %v and has ignorable flag set",
extender, err)
continue
} else {
return []*v1.Node{}, FailedPredicateMap{}, err
}
}
for failedNodeName, failedMsg := range failedMap {
if _, found := failedPredicateMap[failedNodeName]; !found {
failedPredicateMap[failedNodeName] = []algorithm.PredicateFailureReason{}
}
failedPredicateMap[failedNodeName] = append(failedPredicateMap[failedNodeName], predicates.NewFailureReason(failedMsg))
}
filtered = filteredList
if len(filtered) == 0 {
break
}
}
}
return filtered, failedPredicateMap, nil
}
以下对findNodesThatFit分段分析。
3. numFeasibleNodesToFind
findNodesThatFit先基于所有的节点找出可行的节点是总数。numFeasibleNodesToFind的作用主要是避免当节点过多(超过100)影响调度的效率。
allNodes := int32(g.cache.NodeTree().NumNodes)
numNodesToFind := g.numFeasibleNodesToFind(allNodes)
// Create filtered list with enough space to avoid growing it
// and allow assigning.
filtered = make([]*v1.Node, numNodesToFind)
numFeasibleNodesToFind基本流程如下:
- 如果所有的node节点小于
minFeasibleNodesToFind(当前默认为100)则返回节点数。 - 如果节点数超100,则取指定计分的百分比的节点数,当该百分比后的数目仍小于
minFeasibleNodesToFind,则返回minFeasibleNodesToFind。 - 如果百分比后的数目大于
minFeasibleNodesToFind,则返回该百分比。
// numFeasibleNodesToFind returns the number of feasible nodes that once found, the scheduler stops
// its search for more feasible nodes.
func (g *genericScheduler) numFeasibleNodesToFind(numAllNodes int32) int32 {
if numAllNodes < minFeasibleNodesToFind || g.percentageOfNodesToScore <= 0 ||
g.percentageOfNodesToScore >= 100 {
return numAllNodes
}
numNodes := numAllNodes * g.percentageOfNodesToScore / 100
if numNodes < minFeasibleNodesToFind {
return minFeasibleNodesToFind
}
return numNodes
}
4. checkNode
checkNode是一个校验node是否符合要求的函数,其中实际调用到的核心函数是podFitsOnNode。再通过workqueue并发执行checkNode操作。
checkNode主要流程如下:
- 通过cache中的nodeTree不断获取下一个node。
- 将当前node和pod传入
podFitsOnNode判断当前node是否符合要求。 - 如果当前node符合要求就将当前node加入预选节点的数组中
filtered。 - 如果当前node不满足要求,则加入到失败的数组中,并记录原因。
- 通过
workqueue.ParallelizeUntil并发执行checkNode函数,一旦找到配置的可行节点数,就停止搜索更多节点。
checkNode := func(i int) {
var nodeCache *equivalence.NodeCache
nodeName := g.cache.NodeTree().Next()
if g.equivalenceCache != nil {
nodeCache, _ = g.equivalenceCache.GetNodeCache(nodeName)
}
fits, failedPredicates, err := podFitsOnNode(
pod,
meta,
g.cachedNodeInfoMap[nodeName],
g.predicates,
g.cache,
nodeCache,
g.schedulingQueue,
g.alwaysCheckAllPredicates,
equivClass,
)
if err != nil {
predicateResultLock.Lock()
errs[err.Error()]++
predicateResultLock.Unlock()
return
}
if fits {
length := atomic.AddInt32(&filteredLen, 1)
if length > numNodesToFind {
cancel()
atomic.AddInt32(&filteredLen, -1)
} else {
filtered[length-1] = g.cachedNodeInfoMap[nodeName].Node()
}
} else {
predicateResultLock.Lock()
failedPredicateMap[nodeName] = failedPredicates
predicateResultLock.Unlock()
}
}
workqueue的并发操作:
// Stops searching for more nodes once the configured number of feasible nodes
// are found.
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)
ParallelizeUntil具体代码如下:
// ParallelizeUntil is a framework that allows for parallelizing N
// independent pieces of work until done or the context is canceled.
func ParallelizeUntil(ctx context.Context, workers, pieces int, doWorkPiece DoWorkPieceFunc) {
var stop <-chan struct{}
if ctx != nil {
stop = ctx.Done()
}
toProcess := make(chan int, pieces)
for i := 0; i < pieces; i++ {
toProcess <- i
}
close(toProcess)
if pieces < workers {
workers = pieces
}
wg := sync.WaitGroup{}
wg.Add(workers)
for i := 0; i < workers; i++ {
go func() {
defer utilruntime.HandleCrash()
defer wg.Done()
for piece := range toProcess {
select {
case <-stop:
return
default:
doWorkPiece(piece)
}
}
}()
}
wg.Wait()
}
5. podFitsOnNode
podFitsOnNode主要内容如下:
-
podFitsOnNode会检查给定的某个Node是否满足预选的函数。 -
对于给定的pod,
podFitsOnNode会检查是否有相同的pod存在,尽量复用缓存过的预选结果。
podFitsOnNode主要在Schedule(调度)和Preempt(抢占)的时候被调用。
当在Schedule中被调用的时候,主要判断是否可以被调度到当前节点,依据为当前节点上所有已存在的pod及被提名要运行到该节点的具有相等或更高优先级的pod。
当在Preempt中被调用的时候,即发生抢占的时候,通过SelectVictimsOnNode函数选出需要被移除的pod,移除后然后将预调度的pod调度到该节点上。
podFitsOnNode基本流程如下:
- 遍历之前注册好的预选策略
predicates.Ordering,并获取预选策略的执行函数。 - 遍历执行每个预选函数,并返回是否合适,预选失败的原因和错误。
- 如果预选函数执行的结果不合适,则加入预选失败的数组中。
- 最后返回预选失败的个数是否为0,和预选失败的原因。
入参:
- pod
- PredicateMetadata
- NodeInfo
- predicateFuncs
- schedulercache.Cache
- nodeCache
- SchedulingQueue
- alwaysCheckAllPredicates
- equivClass
出参:
- fit
- PredicateFailureReason
完整代码如下:
此部分代码位于pkg/scheduler/core/generic_scheduler.go
// podFitsOnNode checks whether a node given by NodeInfo satisfies the given predicate functions.
// For given pod, podFitsOnNode will check if any equivalent pod exists and try to reuse its cached
// predicate results as possible.
// This function is called from two different places: Schedule and Preempt.
// When it is called from Schedule, we want to test whether the pod is schedulable
// on the node with all the existing pods on the node plus higher and equal priority
// pods nominated to run on the node.
// When it is called from Preempt, we should remove the victims of preemption and
// add the nominated pods. Removal of the victims is done by SelectVictimsOnNode().
// It removes victims from meta and NodeInfo before calling this function.
func podFitsOnNode(
pod *v1.Pod,
meta algorithm.PredicateMetadata,
info *schedulercache.NodeInfo,
predicateFuncs map[string]algorithm.FitPredicate,
cache schedulercache.Cache,
nodeCache *equivalence.NodeCache,
queue SchedulingQueue,
alwaysCheckAllPredicates bool,
equivClass *equivalence.Class,
) (bool, []algorithm.PredicateFailureReason, error) {
var (
eCacheAvailable bool
failedPredicates []algorithm.PredicateFailureReason
)
podsAdded := false
// We run predicates twice in some cases. If the node has greater or equal priority
// nominated pods, we run them when those pods are added to meta and nodeInfo.
// If all predicates succeed in this pass, we run them again when these
// nominated pods are not added. This second pass is necessary because some
// predicates such as inter-pod affinity may not pass without the nominated pods.
// If there are no nominated pods for the node or if the first run of the
// predicates fail, we don't run the second pass.
// We consider only equal or higher priority pods in the first pass, because
// those are the current "pod" must yield to them and not take a space opened
// for running them. It is ok if the current "pod" take resources freed for
// lower priority pods.
// Requiring that the new pod is schedulable in both circumstances ensures that
// we are making a conservative decision: predicates like resources and inter-pod
// anti-affinity are more likely to fail when the nominated pods are treated
// as running, while predicates like pod affinity are more likely to fail when
// the nominated pods are treated as not running. We can't just assume the
// nominated pods are running because they are not running right now and in fact,
// they may end up getting scheduled to a different node.
for i := 0; i < 2; i++ {
metaToUse := meta
nodeInfoToUse := info
if i == 0 {
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue)
} else if !podsAdded || len(failedPredicates) != 0 {
break
}
// Bypass eCache if node has any nominated pods.
// TODO(bsalamat): consider using eCache and adding proper eCache invalidations
// when pods are nominated or their nominations change.
eCacheAvailable = equivClass != nil && nodeCache != nil && !podsAdded
for _, predicateKey := range predicates.Ordering() {
var (
fit bool
reasons []algorithm.PredicateFailureReason
err error
)
//TODO (yastij) : compute average predicate restrictiveness to export it as Prometheus metric
if predicate, exist := predicateFuncs[predicateKey]; exist {
if eCacheAvailable {
fit, reasons, err = nodeCache.RunPredicate(predicate, predicateKey, pod, metaToUse, nodeInfoToUse, equivClass, cache)
} else {
fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
}
if err != nil {
return false, []algorithm.PredicateFailureReason{}, err
}
if !fit {
// eCache is available and valid, and predicates result is unfit, record the fail reasons
failedPredicates = append(failedPredicates, reasons...)
// if alwaysCheckAllPredicates is false, short circuit all predicates when one predicate fails.
if !alwaysCheckAllPredicates {
glog.V(5).Infoln("since alwaysCheckAllPredicates has not been set, the predicate " +
"evaluation is short circuited and there are chances " +
"of other predicates failing as well.")
break
}
}
}
}
}
return len(failedPredicates) == 0, failedPredicates, nil
}
5.1. predicateFuncs
根据之前初注册好的预选策略函数来执行预选,判断节点是否符合调度。
for _, predicateKey := range predicates.Ordering() {
if predicate, exist := predicateFuncs[predicateKey]; exist {
if eCacheAvailable {
fit, reasons, err = nodeCache.RunPredicate(predicate, predicateKey, pod, metaToUse, nodeInfoToUse, equivClass, cache)
} else {
fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
}
预选策略如下:
var (
predicatesOrdering = []string{CheckNodeConditionPred, CheckNodeUnschedulablePred,
GeneralPred, HostNamePred, PodFitsHostPortsPred,
MatchNodeSelectorPred, PodFitsResourcesPred, NoDiskConflictPred,
PodToleratesNodeTaintsPred, PodToleratesNodeNoExecuteTaintsPred, CheckNodeLabelPresencePred,
CheckServiceAffinityPred, MaxEBSVolumeCountPred, MaxGCEPDVolumeCountPred, MaxCSIVolumeCountPred,
MaxAzureDiskVolumeCountPred, CheckVolumeBindingPred, NoVolumeZoneConflictPred,
CheckNodeMemoryPressurePred, CheckNodePIDPressurePred, CheckNodeDiskPressurePred, MatchInterPodAffinityPred}
)
6. PodFitsResources
以下以PodFitsResources这个预选函数为例做分析,其他重要的预选函数待后续单独分析。
PodFitsResources用来检查一个节点是否有足够的资源来运行当前的pod,包括CPU、内存、GPU等。
PodFitsResources基本流程如下:
- 判断当前节点上pod总数加上预调度pod个数是否大于node的可分配pod总数,若是则不允许调度。
- 判断pod的request值是否都为0,若是则允许调度。
- 判断pod的request值加上当前node上所有pod的request值总和是否大于node的可分配资源,若是则不允许调度。
- 判断pod的拓展资源request值加上当前node上所有pod对应的request值总和是否大于node对应的可分配资源,若是则不允许调度。
PodFitsResources的注册代码如下:
factory.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)
PodFitsResources入参:
-
pod
-
nodeInfo
-
PredicateMetadata
PodFitsResources出参:
- fit
- PredicateFailureReason
PodFitsResources完整代码:
此部分的代码位于pkg/scheduler/algorithm/predicates/predicates.go
// PodFitsResources checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
// First return value indicates whether a node has sufficient resources to run a pod while the second return value indicates the
// predicate failure reasons if the node has insufficient resources to run the pod.
func PodFitsResources(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
node := nodeInfo.Node()
if node == nil {
return false, nil, fmt.Errorf("node not found")
}
var predicateFails []algorithm.PredicateFailureReason
allowedPodNumber := nodeInfo.AllowedPodNumber()
if len(nodeInfo.Pods())+1 > allowedPodNumber {
predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourcePods, 1, int64(len(nodeInfo.Pods())), int64(allowedPodNumber)))
}
// No extended resources should be ignored by default.
ignoredExtendedResources := sets.NewString()
var podRequest *schedulercache.Resource
if predicateMeta, ok := meta.(*predicateMetadata); ok {
podRequest = predicateMeta.podRequest
if predicateMeta.ignoredExtendedResources != nil {
ignoredExtendedResources = predicateMeta.ignoredExtendedResources
}
} else {
// We couldn't parse metadata - fallback to computing it.
podRequest = GetResourceRequest(pod)
}
if podRequest.MilliCPU == 0 &&
podRequest.Memory == 0 &&
podRequest.EphemeralStorage == 0 &&
len(podRequest.ScalarResources) == 0 {
return len(predicateFails) == 0, predicateFails, nil
}
allocatable := nodeInfo.AllocatableResource()
if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
}
if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
}
if allocatable.EphemeralStorage < podRequest.EphemeralStorage+nodeInfo.RequestedResource().EphemeralStorage {
predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceEphemeralStorage, podRequest.EphemeralStorage, nodeInfo.RequestedResource().EphemeralStorage, allocatable.EphemeralStorage))
}
for rName, rQuant := range podRequest.ScalarResources {
if v1helper.IsExtendedResourceName(rName) {
// If this resource is one of the extended resources that should be
// ignored, we will skip checking it.
if ignoredExtendedResources.Has(string(rName)) {
continue
}
}
if allocatable.ScalarResources[rName] < rQuant+nodeInfo.RequestedResource().ScalarResources[rName] {
predicateFails = append(predicateFails, NewInsufficientResourceError(rName, podRequest.ScalarResources[rName], nodeInfo.RequestedResource().ScalarResources[rName], allocatable.ScalarResources[rName]))
}
}
if glog.V(10) {
if len(predicateFails) == 0 {
// We explicitly don't do glog.V(10).Infof() to avoid computing all the parameters if this is
// not logged. There is visible performance gain from it.
glog.Infof("Schedule Pod %+v on Node %+v is allowed, Node is running only %v out of %v Pods.",
podName(pod), node.Name, len(nodeInfo.Pods()), allowedPodNumber)
}
}
return len(predicateFails) == 0, predicateFails, nil
}
6.1. NodeInfo
NodeInfo是node的聚合信息,主要包括:
- node:k8s node的结构体
- pods:当前node上pod的数量
- requestedResource:当前node上所有pod的request总和
- allocatableResource:node的实际所有的可分配资源(对应于Node.Status.Allocatable.*),可理解为node的资源总量。
此部分代码位于pkg/scheduler/cache/node_info.go
// NodeInfo is node level aggregated information.
type NodeInfo struct {
// Overall node information.
node *v1.Node
pods []*v1.Pod
podsWithAffinity []*v1.Pod
usedPorts util.HostPortInfo
// Total requested resource of all pods on this node.
// It includes assumed pods which scheduler sends binding to apiserver but
// didn't get it as scheduled yet.
requestedResource *Resource
nonzeroRequest *Resource
// We store allocatedResources (which is Node.Status.Allocatable.*) explicitly
// as int64, to avoid conversions and accessing map.
allocatableResource *Resource
// Cached taints of the node for faster lookup.
taints []v1.Taint
taintsErr error
// imageStates holds the entry of an image if and only if this image is on the node. The entry can be used for
// checking an image's existence and advanced usage (e.g., image locality scheduling policy) based on the image
// state information.
imageStates map[string]*ImageStateSummary
// TransientInfo holds the information pertaining to a scheduling cycle. This will be destructed at the end of
// scheduling cycle.
// TODO: @ravig. Remove this once we have a clear approach for message passing across predicates and priorities.
TransientInfo *transientSchedulerInfo
// Cached conditions of node for faster lookup.
memoryPressureCondition v1.ConditionStatus
diskPressureCondition v1.ConditionStatus
pidPressureCondition v1.ConditionStatus
// Whenever NodeInfo changes, generation is bumped.
// This is used to avoid cloning it if the object didn't change.
generation int64
}
6.2. Resource
Resource是可计算资源的集合体。主要包括:
- MilliCPU
- Memory
- EphemeralStorage
- AllowedPodNumber:允许的pod总数(对应于Node.Status.Allocatable.Pods().Value()),一般为110。
- ScalarResources
// Resource is a collection of compute resource.
type Resource struct {
MilliCPU int64
Memory int64
EphemeralStorage int64
// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
// explicitly as int, to avoid conversions and improve performance.
AllowedPodNumber int
// ScalarResources
ScalarResources map[v1.ResourceName]int64
}
以下分析podFitsOnNode的具体流程。
6.3. allowedPodNumber
首先获取节点的信息,先判断如果该节点当前所有的pod的个数加上当前预调度的pod是否会大于该节点允许的pod的总数,一般为110个。如果超过,则predicateFails数组增加1,即当前节点不适合该pod。
node := nodeInfo.Node()
if node == nil {
return false, nil, fmt.Errorf("node not found")
}
var predicateFails []algorithm.PredicateFailureReason
allowedPodNumber := nodeInfo.AllowedPodNumber()
if len(nodeInfo.Pods())+1 > allowedPodNumber {
predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourcePods, 1, int64(len(nodeInfo.Pods())), int64(allowedPodNumber)))
}
6.4. podRequest
如果podRequest都为0,则允许调度到该节点,直接返回结果。
if podRequest.MilliCPU == 0 &&
podRequest.Memory == 0 &&
podRequest.EphemeralStorage == 0 &&
len(podRequest.ScalarResources) == 0 {
return len(predicateFails) == 0, predicateFails, nil
}
6.5. AllocatableResource
如果当前预调度的pod的request资源加上当前node上所有pod的request总和大于该node的可分配资源总量,则不允许调度到该节点,直接返回结果。其中request资源包括CPU、内存、storage。
allocatable := nodeInfo.AllocatableResource()
if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
}
if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
}
if allocatable.EphemeralStorage < podRequest.EphemeralStorage+nodeInfo.RequestedResource().EphemeralStorage {
predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceEphemeralStorage, podRequest.EphemeralStorage, nodeInfo.RequestedResource().EphemeralStorage, allocatable.EphemeralStorage))
}
6.6. ScalarResources
判断其他拓展的标量资源,是否该pod的request值加上当前node上所有pod的对应资源的request总和大于该node上对应资源的可分配总量,如果是,则不允许调度到该节点。
for rName, rQuant := range podRequest.ScalarResources {
if v1helper.IsExtendedResourceName(rName) {
// If this resource is one of the extended resources that should be
// ignored, we will skip checking it.
if ignoredExtendedResources.Has(string(rName)) {
continue
}
}
if allocatable.ScalarResources[rName] < rQuant+nodeInfo.RequestedResource().ScalarResources[rName] {
predicateFails = append(predicateFails, NewInsufficientResourceError(rName, podRequest.ScalarResources[rName], nodeInfo.RequestedResource().ScalarResources[rName], allocatable.ScalarResources[rName]))
}
}
7. 总结
findNodesThatFit基于给定的预选函数过滤node,每个node传入到预选函数中来确实该节点是否符合要求。
findNodesThatFit的入参是被调度的pod和当前的节点列表,返回预选节点列表和错误。
findNodesThatFit基本流程如下:
- 设置可行节点的总数,作为预选节点数组的容量,避免总节点过多导致需要筛选的节点过多,效率低。
- 通过
NodeTree不断获取下一个节点来判断该节点是否满足pod的调度条件。 - 通过之前注册的各种预选函数来判断当前节点是否符合pod的调度条件。
- 最后返回满足调度条件的node列表,供下一步的优选操作。
7.1. checkNode
checkNode是一个校验node是否符合要求的函数,其中实际调用到的核心函数是podFitsOnNode。再通过workqueue并发执行checkNode操作。
checkNode主要流程如下:
- 通过cache中的nodeTree不断获取下一个node。
- 将当前node和pod传入
podFitsOnNode判断当前node是否符合要求。 - 如果当前node符合要求就将当前node加入预选节点的数组中
filtered。 - 如果当前node不满足要求,则加入到失败的数组中,并记录原因。
- 通过
workqueue.ParallelizeUntil并发执行checkNode函数,一旦找到配置的可行节点数,就停止搜索更多节点。
7.2. podFitsOnNode
其中会调用到核心函数podFitsOnNode。
podFitsOnNode主要内容如下:
-
podFitsOnNode会检查给定的某个Node是否满足预选的函数。 -
对于给定的pod,
podFitsOnNode会检查是否有相同的pod存在,尽量复用缓存过的预选结果。
podFitsOnNode主要在Schedule(调度)和Preempt(抢占)的时候被调用。
当在Schedule中被调用的时候,主要判断是否可以被调度到当前节点,依据为当前节点上所有已存在的pod及被提名要运行到该节点的具有相等或更高优先级的pod。
当在Preempt中被调用的时候,即发生抢占的时候,通过SelectVictimsOnNode函数选出需要被移除的pod,移除后然后将预调度的pod调度到该节点上。
podFitsOnNode基本流程如下:
- 遍历之前注册好的预选策略
predicates.Ordering,并获取预选策略的执行函数。 - 遍历执行每个
预选函数,并返回是否合适,预选失败的原因和错误。 - 如果预选函数执行的结果不合适,则加入预选失败的数组中。
- 最后返回预选失败的个数是否为0,和预选失败的原因。
7.3. PodFitsResources
本文只示例分析了其中一个重要的预选函数:PodFitsResources
PodFitsResources用来检查一个节点是否有足够的资源来运行当前的pod,包括CPU、内存、GPU等。
PodFitsResources基本流程如下:
- 判断当前节点上pod总数加上预调度pod个数是否大于node的可分配pod总数,若是则不允许调度。
- 判断pod的request值是否都为0,若是则允许调度。
- 判断pod的request值加上当前node上所有pod的request值总和是否大于node的可分配资源,若是则不允许调度。
- 判断pod的拓展资源request值加上当前node上所有pod对应的request值总和是否大于node对应的可分配资源,若是则不允许调度。
参考:
11.4.2 -
kube-scheduler源码分析(一)之 NewSchedulerCommand
以下代码分析基于
kubernetes v1.12.0版本。
scheduler的cmd代码目录结构如下:
kube-scheduler
├── BUILD
├── OWNERS
├── app # app的目录下主要为运行scheduler相关的对象
│ ├── BUILD
│ ├── config
│ │ ├── BUILD
│ │ └── config.go # Scheduler的配置对象config
│ ├── options # options主要记录 Scheduler 使用到的参数
│ │ ├── BUILD
│ │ ├── configfile.go
│ │ ├── deprecated.go
│ │ ├── deprecated_test.go
│ │ ├── insecure_serving.go
│ │ ├── insecure_serving_test.go
│ │ ├── options.go # 主要包括Options、NewOptions、AddFlags、Config等函数
│ │ └── options_test.go
│ └── server.go # 主要包括 NewSchedulerCommand、NewSchedulerConfig、Run等函数
└── scheduler.go # main入口函数
1. Main函数
此部分的代码为/cmd/kube-scheduler/scheduler.go
kube-scheduler的入口函数Main函数,仍然是采用统一的代码风格,使用Cobra命令行框架。
func main() {
rand.Seed(time.Now().UTC().UnixNano())
command := app.NewSchedulerCommand()
// TODO: once we switch everything over to Cobra commands, we can go back to calling
// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the
// normalize func and add the go flag set by hand.
pflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc)
pflag.CommandLine.AddGoFlagSet(goflag.CommandLine)
// utilflag.InitFlags()
logs.InitLogs()
defer logs.FlushLogs()
if err := command.Execute(); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
}
核心代码:
// 初始化scheduler命令结构体
command := app.NewSchedulerCommand()
// 执行Execute
err := command.Execute()
2. NewSchedulerCommand
此部分的代码为/cmd/kube-scheduler/app/server.go
NewSchedulerCommand主要用来构造和初始化SchedulerCommand结构体,
// NewSchedulerCommand creates a *cobra.Command object with default parameters
func NewSchedulerCommand() *cobra.Command {
opts, err := options.NewOptions()
if err != nil {
glog.Fatalf("unable to initialize command options: %v", err)
}
cmd := &cobra.Command{
Use: "kube-scheduler",
Long: `The Kubernetes scheduler is a policy-rich, topology-aware,
workload-specific function that significantly impacts availability, performance,
and capacity. The scheduler needs to take into account individual and collective
resource requirements, quality of service requirements, hardware/software/policy
constraints, affinity and anti-affinity specifications, data locality, inter-workload
interference, deadlines, and so on. Workload-specific requirements will be exposed
through the API as necessary.`,
Run: func(cmd *cobra.Command, args []string) {
verflag.PrintAndExitIfRequested()
utilflag.PrintFlags(cmd.Flags())
if len(args) != 0 {
fmt.Fprint(os.Stderr, "arguments are not supported\n")
}
if errs := opts.Validate(); len(errs) > 0 {
fmt.Fprintf(os.Stderr, "%v\n", utilerrors.NewAggregate(errs))
os.Exit(1)
}
if len(opts.WriteConfigTo) > 0 {
if err := options.WriteConfigFile(opts.WriteConfigTo, &opts.ComponentConfig); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
glog.Infof("Wrote configuration to: %s\n", opts.WriteConfigTo)
return
}
c, err := opts.Config()
if err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
stopCh := make(chan struct{})
if err := Run(c.Complete(), stopCh); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
},
}
opts.AddFlags(cmd.Flags())
cmd.MarkFlagFilename("config", "yaml", "yml", "json")
return cmd
}
核心代码:
// 构造option
opts, err := options.NewOptions()
// 初始化config对象
c, err := opts.Config()
// 执行run函数
err := Run(c.Complete(), stopCh)
// 添加参数
opts.AddFlags(cmd.Flags())
2.1. NewOptions
NewOptions主要用来构造SchedulerServer使用的参数和上下文,其中核心参数是KubeSchedulerConfiguration。
opts, err := options.NewOptions()
NewOptions:
// NewOptions returns default scheduler app options.
func NewOptions() (*Options, error) {
cfg, err := newDefaultComponentConfig()
if err != nil {
return nil, err
}
hhost, hport, err := splitHostIntPort(cfg.HealthzBindAddress)
if err != nil {
return nil, err
}
o := &Options{
ComponentConfig: *cfg,
SecureServing: nil, // TODO: enable with apiserveroptions.NewSecureServingOptions()
CombinedInsecureServing: &CombinedInsecureServingOptions{
Healthz: &apiserveroptions.DeprecatedInsecureServingOptions{
BindNetwork: "tcp",
},
Metrics: &apiserveroptions.DeprecatedInsecureServingOptions{
BindNetwork: "tcp",
},
BindPort: hport,
BindAddress: hhost,
},
Authentication: nil, // TODO: enable with apiserveroptions.NewDelegatingAuthenticationOptions()
Authorization: nil, // TODO: enable with apiserveroptions.NewDelegatingAuthorizationOptions()
Deprecated: &DeprecatedOptions{
UseLegacyPolicyConfig: false,
PolicyConfigMapNamespace: metav1.NamespaceSystem,
},
}
return o, nil
}
2.2. Options.Config
Config初始化调度器的配置对象。
c, err := opts.Config()
Config函数主要执行以下操作:
- 构建scheduler client、leaderElectionClient、eventClient。
- 创建event recorder
- 设置leader选举
- 创建informer对象,主要函数有
NewSharedInformerFactory和NewPodInformer。
Config具体代码如下:
// Config return a scheduler config object
func (o *Options) Config() (*schedulerappconfig.Config, error) {
c := &schedulerappconfig.Config{}
if err := o.ApplyTo(c); err != nil {
return nil, err
}
// prepare kube clients.
client, leaderElectionClient, eventClient, err := createClients(c.ComponentConfig.ClientConnection, o.Master, c.ComponentConfig.LeaderElection.RenewDeadline.Duration)
if err != nil {
return nil, err
}
// Prepare event clients.
eventBroadcaster := record.NewBroadcaster()
recorder := eventBroadcaster.NewRecorder(legacyscheme.Scheme, corev1.EventSource{Component: c.ComponentConfig.SchedulerName})
// Set up leader election if enabled.
var leaderElectionConfig *leaderelection.LeaderElectionConfig
if c.ComponentConfig.LeaderElection.LeaderElect {
leaderElectionConfig, err = makeLeaderElectionConfig(c.ComponentConfig.LeaderElection, leaderElectionClient, recorder)
if err != nil {
return nil, err
}
}
c.Client = client
c.InformerFactory = informers.NewSharedInformerFactory(client, 0)
c.PodInformer = factory.NewPodInformer(client, 0)
c.EventClient = eventClient
c.Recorder = recorder
c.Broadcaster = eventBroadcaster
c.LeaderElection = leaderElectionConfig
return c, nil
}
2.3. AddFlags
AddFlags为SchedulerServer添加指定的参数。
opts.AddFlags(cmd.Flags())
AddFlags函数的具体代码如下:
// AddFlags adds flags for the scheduler options.
func (o *Options) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&o.ConfigFile, "config", o.ConfigFile, "The path to the configuration file. Flags override values in this file.")
fs.StringVar(&o.WriteConfigTo, "write-config-to", o.WriteConfigTo, "If set, write the configuration values to this file and exit.")
fs.StringVar(&o.Master, "master", o.Master, "The address of the Kubernetes API server (overrides any value in kubeconfig)")
o.SecureServing.AddFlags(fs)
o.CombinedInsecureServing.AddFlags(fs)
o.Authentication.AddFlags(fs)
o.Authorization.AddFlags(fs)
o.Deprecated.AddFlags(fs, &o.ComponentConfig)
leaderelectionconfig.BindFlags(&o.ComponentConfig.LeaderElection.LeaderElectionConfiguration, fs)
utilfeature.DefaultFeatureGate.AddFlag(fs)
}
3. Run
此部分的代码为/cmd/kube-scheduler/app/server.go
err := Run(c.Complete(), stopCh)
Run运行一个不退出的常驻进程,来执行scheduler的相关操作。
Run函数的主要内容如下:
- 通过scheduler config来创建scheduler的结构体。
- 运行event broadcaster、healthz server、metrics server。
- 运行所有的informer并在调度前等待cache的同步(重点)。
- 执行
sched.Run()来运行scheduler的调度逻辑。 - 如果多个scheduler并开启了
LeaderElect,则执行leader选举。
以下对重点代码分开分析:
3.1. NewSchedulerConfig
NewSchedulerConfig初始化SchedulerConfig(此部分具体逻辑待后续专门分析),最后初始化生成scheduler结构体。
// Build a scheduler config from the provided algorithm source.
schedulerConfig, err := NewSchedulerConfig(c)
if err != nil {
return err
}
// Create the scheduler.
sched := scheduler.NewFromConfig(schedulerConfig)
3.2. InformerFactory.Start
运行PodInformer,并运行InformerFactory。此部分的逻辑为client-go的informer机制,在Informer机制中有详细分析。
// Start all informers.
go c.PodInformer.Informer().Run(stopCh)
c.InformerFactory.Start(stopCh)
3.3. WaitForCacheSync
在调度前等待cache同步。
// Wait for all caches to sync before scheduling.
c.InformerFactory.WaitForCacheSync(stopCh)
controller.WaitForCacheSync("scheduler", stopCh, c.PodInformer.Informer().HasSynced)
3.3.1. InformerFactory.WaitForCacheSync
InformerFactory.WaitForCacheSync等待所有启动的informer的cache进行同步,保持本地的store信息与etcd的信息是最新一致的。
// WaitForCacheSync waits for all started informers' cache were synced.
func (f *sharedInformerFactory) WaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool {
informers := func() map[reflect.Type]cache.SharedIndexInformer {
f.lock.Lock()
defer f.lock.Unlock()
informers := map[reflect.Type]cache.SharedIndexInformer{}
for informerType, informer := range f.informers {
if f.startedInformers[informerType] {
informers[informerType] = informer
}
}
return informers
}()
res := map[reflect.Type]bool{}
for informType, informer := range informers {
res[informType] = cache.WaitForCacheSync(stopCh, informer.HasSynced)
}
return res
}
接着调用 cache.WaitForCacheSync。
// WaitForCacheSync waits for caches to populate. It returns true if it was successful, false
// if the controller should shutdown
func WaitForCacheSync(stopCh <-chan struct{}, cacheSyncs ...InformerSynced) bool {
err := wait.PollUntil(syncedPollPeriod,
func() (bool, error) {
for _, syncFunc := range cacheSyncs {
if !syncFunc() {
return false, nil
}
}
return true, nil
},
stopCh)
if err != nil {
glog.V(2).Infof("stop requested")
return false
}
glog.V(4).Infof("caches populated")
return true
}
3.3.2. controller.WaitForCacheSync
controller.WaitForCacheSync是对cache.WaitForCacheSync的一层封装,通过不同的controller的名字来记录不同controller等待cache同步。
controller.WaitForCacheSync("scheduler", stop, s.PodInformer.Informer().HasSynced)
controller.WaitForCacheSync具体代码如下:
// WaitForCacheSync is a wrapper around cache.WaitForCacheSync that generates log messages
// indicating that the controller identified by controllerName is waiting for syncs, followed by
// either a successful or failed sync.
func WaitForCacheSync(controllerName string, stopCh <-chan struct{}, cacheSyncs ...cache.InformerSynced) bool {
glog.Infof("Waiting for caches to sync for %s controller", controllerName)
if !cache.WaitForCacheSync(stopCh, cacheSyncs...) {
utilruntime.HandleError(fmt.Errorf("Unable to sync caches for %s controller", controllerName))
return false
}
glog.Infof("Caches are synced for %s controller", controllerName)
return true
}
3.4. LeaderElection
如果有多个scheduler,并开启leader选举,则运行LeaderElector直到选举结束或退出。
// If leader election is enabled, run via LeaderElector until done and exit.
if c.LeaderElection != nil {
c.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
utilruntime.HandleError(fmt.Errorf("lost master"))
},
}
leaderElector, err := leaderelection.NewLeaderElector(*c.LeaderElection)
if err != nil {
return fmt.Errorf("couldn't create leader elector: %v", err)
}
leaderElector.Run(ctx)
return fmt.Errorf("lost lease")
}
3.5. Scheduler.Run
// Prepare a reusable run function.
run := func(ctx context.Context) {
sched.Run()
<-ctx.Done()
}
ctx, cancel := context.WithCancel(context.TODO()) // TODO once Run() accepts a context, it should be used here
defer cancel()
go func() {
select {
case <-stopCh:
cancel()
case <-ctx.Done():
}
}()
...
run(ctx)
Scheduler.Run先等待cache同步,然后开启调度逻辑的goroutine。
Scheduler.Run的具体代码如下:
// Run begins watching and scheduling. It waits for cache to be synced, then starts a goroutine and returns immediately.
func (sched *Scheduler) Run() {
if !sched.config.WaitForCacheSync() {
return
}
go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)
}
以上是对/cmd/kube-scheduler/scheduler.go部分代码的分析,Scheduler.Run后续的具体代码位于pkg/scheduler/scheduler.go待后续文章分析。
参考:
11.4.3 -
kube-scheduler源码分析(六)之 preempt
以下代码分析基于
kubernetes v1.12.0版本。
本文主要分析调度中的抢占逻辑,当pod不适合任何节点的时候,可能pod会调度失败,这时候可能会发生抢占。抢占逻辑的具体实现函数为Scheduler.preempt。
1. 调用入口
当pod不适合任何节点的时候,可能pod会调度失败。这时候可能会发生抢占。
scheduleOne函数中关于抢占调用的逻辑如下:
此部分的代码位于/pkg/scheduler/scheduler.go
// scheduleOne does the entire scheduling workflow for a single pod. It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne() {
...
suggestedHost, err := sched.schedule(pod)
if err != nil {
// schedule() may have failed because the pod would not fit on any host, so we try to
// preempt, with the expectation that the next time the pod is tried for scheduling it
// will fit due to the preemption. It is also possible that a different pod will schedule
// into the resources that were preempted, but this is harmless.
if fitError, ok := err.(*core.FitError); ok {
preemptionStartTime := time.Now()
// 执行抢占逻辑
sched.preempt(pod, fitError)
metrics.PreemptionAttempts.Inc()
metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
}
return
}
...
}
其中核心代码为:
// 基于sched.schedule(pod)返回的err和当前待调度的pod执行抢占策略
sched.preempt(pod, fitError)
2. Scheduler.preempt
当pod调度失败的时候,会抢占低优先级pod的空间来给高优先级的pod。其中入参为调度失败的pod对象和调度失败的err。
抢占的基本流程如下:
- 判断是否有关闭抢占机制,如果关闭抢占机制则直接返回。
- 获取调度失败pod的最新对象数据。
- 执行抢占算法
Algorithm.Preempt,返回预调度节点和需要被剔除的pod列表。 - 将抢占算法返回的node添加到pod的
Status.NominatedNodeName中,并删除需要被剔除的pod。 - 当抢占算法返回的node是nil的时候,清除pod的
Status.NominatedNodeName信息。
整个抢占流程的最终结果实际上是更新Pod.Status.NominatedNodeName属性的信息。如果抢占算法返回的节点不为空,则将该node更新到Pod.Status.NominatedNodeName中,否则就将Pod.Status.NominatedNodeName设置为空。
2.1. preempt
preempt的具体实现函数:
此部分的代码位于/pkg/scheduler/scheduler.go
// preempt tries to create room for a pod that has failed to schedule, by preempting lower priority pods if possible.
// If it succeeds, it adds the name of the node where preemption has happened to the pod annotations.
// It returns the node name and an error if any.
func (sched *Scheduler) preempt(preemptor *v1.Pod, scheduleErr error) (string, error) {
if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
glog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
" No preemption is performed.")
return "", nil
}
preemptor, err := sched.config.PodPreemptor.GetUpdatedPod(preemptor)
if err != nil {
glog.Errorf("Error getting the updated preemptor pod object: %v", err)
return "", err
}
node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(preemptor, sched.config.NodeLister, scheduleErr)
metrics.PreemptionVictims.Set(float64(len(victims)))
if err != nil {
glog.Errorf("Error preempting victims to make room for %v/%v.", preemptor.Namespace, preemptor.Name)
return "", err
}
var nodeName = ""
if node != nil {
nodeName = node.Name
err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, nodeName)
if err != nil {
glog.Errorf("Error in preemption process. Cannot update pod %v/%v annotations: %v", preemptor.Namespace, preemptor.Name, err)
return "", err
}
for _, victim := range victims {
if err := sched.config.PodPreemptor.DeletePod(victim); err != nil {
glog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err)
return "", err
}
sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)
}
}
// Clearing nominated pods should happen outside of "if node != nil". Node could
// be nil when a pod with nominated node name is eligible to preempt again,
// but preemption logic does not find any node for it. In that case Preempt()
// function of generic_scheduler.go returns the pod itself for removal of the annotation.
for _, p := range nominatedPodsToClear {
rErr := sched.config.PodPreemptor.RemoveNominatedNodeName(p)
if rErr != nil {
glog.Errorf("Cannot remove nominated node annotation of pod: %v", rErr)
// We do not return as this error is not critical.
}
}
return nodeName, err
}
以下对preempt的实现分段分析。
如果设置关闭抢占机制,则直接返回。
if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
glog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
" No preemption is performed.")
return "", nil
}
获取当前pod的最新状态。
preemptor, err := sched.config.PodPreemptor.GetUpdatedPod(preemptor)
if err != nil {
glog.Errorf("Error getting the updated preemptor pod object: %v", err)
return "", err
}
GetUpdatedPod的实现就是去拿pod的对象。
func (p *podPreemptor) GetUpdatedPod(pod *v1.Pod) (*v1.Pod, error) {
return p.Client.CoreV1().Pods(pod.Namespace).Get(pod.Name, metav1.GetOptions{})
}
接着执行抢占的算法。抢占的算法返回预调度节点的信息和因抢占被剔除的pod的信息。具体的抢占算法逻辑下文分析。
node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(preemptor, sched.config.NodeLister, scheduleErr)
将预调度节点的信息更新到pod的Status.NominatedNodeName属性中。
err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, nodeName)
SetNominatedNodeName的具体实现为:
func (p *podPreemptor) SetNominatedNodeName(pod *v1.Pod, nominatedNodeName string) error {
podCopy := pod.DeepCopy()
podCopy.Status.NominatedNodeName = nominatedNodeName
_, err := p.Client.CoreV1().Pods(pod.Namespace).UpdateStatus(podCopy)
return err
}
接着删除因抢占而需要被剔除的pod。
err := sched.config.PodPreemptor.DeletePod(victim)
PodPreemptor.DeletePod的具体实现就是删除具体的pod。
func (p *podPreemptor) DeletePod(pod *v1.Pod) error {
return p.Client.CoreV1().Pods(pod.Namespace).Delete(pod.Name, &metav1.DeleteOptions{})
}
如果抢占算法得出的node对象为nil,则将pod的Status.NominatedNodeName属性设置为空。
// Clearing nominated pods should happen outside of "if node != nil". Node could
// be nil when a pod with nominated node name is eligible to preempt again,
// but preemption logic does not find any node for it. In that case Preempt()
// function of generic_scheduler.go returns the pod itself for removal of the annotation.
for _, p := range nominatedPodsToClear {
rErr := sched.config.PodPreemptor.RemoveNominatedNodeName(p)
if rErr != nil {
glog.Errorf("Cannot remove nominated node annotation of pod: %v", rErr)
// We do not return as this error is not critical.
}
}
RemoveNominatedNodeName的具体实现如下:
func (p *podPreemptor) RemoveNominatedNodeName(pod *v1.Pod) error {
if len(pod.Status.NominatedNodeName) == 0 {
return nil
}
return p.SetNominatedNodeName(pod, "")
}
2.2. NominatedNodeName
Pod.Status.NominatedNodeName的说明:
nominatedNodeName是调度失败的pod抢占别的pod的时候,被抢占pod的运行节点。但在剔除被抢占pod之前该调度失败的pod不会被调度。同时也不保证最终该pod一定会调度到nominatedNodeName的机器上,也可能因为之后资源充足等原因调度到其他节点上。最终该pod会被加到调度的队列中。
其中加入到调度队列的具体过程如下:
func NewConfigFactory(args *ConfigFactoryArgs) scheduler.Configurator {
...
// unscheduled pod queue
args.PodInformer.Informer().AddEventHandler(
...
Handler: cache.ResourceEventHandlerFuncs{
AddFunc: c.addPodToSchedulingQueue,
UpdateFunc: c.updatePodInSchedulingQueue,
DeleteFunc: c.deletePodFromSchedulingQueue,
},
},
)
...
}
addPodToSchedulingQueue:
func (c *configFactory) addPodToSchedulingQueue(obj interface{}) {
if err := c.podQueue.Add(obj.(*v1.Pod)); err != nil {
runtime.HandleError(fmt.Errorf("unable to queue %T: %v", obj, err))
}
}
PriorityQueue.Add:
// Add adds a pod to the active queue. It should be called only when a new pod
// is added so there is no chance the pod is already in either queue.
func (p *PriorityQueue) Add(pod *v1.Pod) error {
p.lock.Lock()
defer p.lock.Unlock()
err := p.activeQ.Add(pod)
if err != nil {
glog.Errorf("Error adding pod %v/%v to the scheduling queue: %v", pod.Namespace, pod.Name, err)
} else {
if p.unschedulableQ.get(pod) != nil {
glog.Errorf("Error: pod %v/%v is already in the unschedulable queue.", pod.Namespace, pod.Name)
p.deleteNominatedPodIfExists(pod)
p.unschedulableQ.delete(pod)
}
p.addNominatedPodIfNeeded(pod)
p.cond.Broadcast()
}
return err
}
addNominatedPodIfNeeded:
// addNominatedPodIfNeeded adds a pod to nominatedPods if it has a NominatedNodeName and it does not
// already exist in the map. Adding an existing pod is not going to update the pod.
func (p *PriorityQueue) addNominatedPodIfNeeded(pod *v1.Pod) {
nnn := NominatedNodeName(pod)
if len(nnn) > 0 {
for _, np := range p.nominatedPods[nnn] {
if np.UID == pod.UID {
glog.Errorf("Pod %v/%v already exists in the nominated map!", pod.Namespace, pod.Name)
return
}
}
p.nominatedPods[nnn] = append(p.nominatedPods[nnn], pod)
}
}
NominatedNodeName:
// NominatedNodeName returns nominated node name of a Pod.
func NominatedNodeName(pod *v1.Pod) string {
return pod.Status.NominatedNodeName
}
3. genericScheduler.Preempt
抢占算法依然是在ScheduleAlgorithm接口中定义。
// ScheduleAlgorithm is an interface implemented by things that know how to schedule pods
// onto machines.
type ScheduleAlgorithm interface {
Schedule(*v1.Pod, NodeLister) (selectedMachine string, err error)
// Preempt receives scheduling errors for a pod and tries to create room for
// the pod by preempting lower priority pods if possible.
// It returns the node where preemption happened, a list of preempted pods, a
// list of pods whose nominated node name should be removed, and error if any.
Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, cleanupNominatedPods []*v1.Pod, err error)
// Predicates() returns a pointer to a map of predicate functions. This is
// exposed for testing.
Predicates() map[string]FitPredicate
// Prioritizers returns a slice of priority config. This is exposed for
// testing.
Prioritizers() []PriorityConfig
}
Preempt的具体实现为genericScheduler结构体。
Preempt的主要实现是找到可以调度的节点和上面因抢占而需要被剔除的pod。
基本流程如下:
- 根据调度失败的原因对所有节点先进行一批筛选,筛选出潜在的被调度节点列表。
- 通过
selectNodesForPreemption筛选出需要牺牲的pod和其节点。 - 基于拓展抢占逻辑再次对上述筛选出来的牺牲者做过滤。
- 基于上述的过滤结果,选择一个最终可能因抢占被调度的节点。
- 基于上述的候选节点,找出该节点上优先级低于当前被调度pod的牺牲者pod列表。
完整代码如下:
此部分代码位于pkg/scheduler/core/generic_scheduler.go
// preempt finds nodes with pods that can be preempted to make room for "pod" to
// schedule. It chooses one of the nodes and preempts the pods on the node and
// returns 1) the node, 2) the list of preempted pods if such a node is found,
// 3) A list of pods whose nominated node name should be cleared, and 4) any
// possible error.
func (g *genericScheduler) Preempt(pod *v1.Pod, nodeLister algorithm.NodeLister, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
// Scheduler may return various types of errors. Consider preemption only if
// the error is of type FitError.
fitError, ok := scheduleErr.(*FitError)
if !ok || fitError == nil {
return nil, nil, nil, nil
}
err := g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
if err != nil {
return nil, nil, nil, err
}
if !podEligibleToPreemptOthers(pod, g.cachedNodeInfoMap) {
glog.V(5).Infof("Pod %v/%v is not eligible for more preemption.", pod.Namespace, pod.Name)
return nil, nil, nil, nil
}
allNodes, err := nodeLister.List()
if err != nil {
return nil, nil, nil, err
}
if len(allNodes) == 0 {
return nil, nil, nil, ErrNoNodesAvailable
}
potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)
if len(potentialNodes) == 0 {
glog.V(3).Infof("Preemption will not help schedule pod %v/%v on any node.", pod.Namespace, pod.Name)
// In this case, we should clean-up any existing nominated node name of the pod.
return nil, nil, []*v1.Pod{pod}, nil
}
pdbs, err := g.cache.ListPDBs(labels.Everything())
if err != nil {
return nil, nil, nil, err
}
// 找出可能被抢占的节点
nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, g.predicates,
g.predicateMetaProducer, g.schedulingQueue, pdbs)
if err != nil {
return nil, nil, nil, err
}
// We will only check nodeToVictims with extenders that support preemption.
// Extenders which do not support preemption may later prevent preemptor from being scheduled on the nominated
// node. In that case, scheduler will find a different host for the preemptor in subsequent scheduling cycles.
nodeToVictims, err = g.processPreemptionWithExtenders(pod, nodeToVictims)
if err != nil {
return nil, nil, nil, err
}
// 选出最终被抢占的节点
candidateNode := pickOneNodeForPreemption(nodeToVictims)
if candidateNode == nil {
return nil, nil, nil, err
}
// Lower priority pods nominated to run on this node, may no longer fit on
// this node. So, we should remove their nomination. Removing their
// nomination updates these pods and moves them to the active queue. It
// lets scheduler find another place for them.
// 找出被强占节点上牺牲者pod列表
nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
if nodeInfo, ok := g.cachedNodeInfoMap[candidateNode.Name]; ok {
return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, err
}
return nil, nil, nil, fmt.Errorf(
"preemption failed: the target node %s has been deleted from scheduler cache",
candidateNode.Name)
}
以下对genericScheduler.Preempt分段进行分析。
3.1. selectNodesForPreemption
selectNodesForPreemption并行地所有节点中找可能被抢占的节点。
nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, g.predicates,g.predicateMetaProducer, g.schedulingQueue, pdbs)
selectNodesForPreemption主要基于selectVictimsOnNode构造一个checkNode的函数,然后并发执行该函数。
selectNodesForPreemption具体实现如下:
// selectNodesForPreemption finds all the nodes with possible victims for
// preemption in parallel.
func selectNodesForPreemption(pod *v1.Pod,
nodeNameToInfo map[string]*schedulercache.NodeInfo,
potentialNodes []*v1.Node,
predicates map[string]algorithm.FitPredicate,
metadataProducer algorithm.PredicateMetadataProducer,
queue SchedulingQueue,
pdbs []*policy.PodDisruptionBudget,
) (map[*v1.Node]*schedulerapi.Victims, error) {
nodeToVictims := map[*v1.Node]*schedulerapi.Victims{}
var resultLock sync.Mutex
// We can use the same metadata producer for all nodes.
meta := metadataProducer(pod, nodeNameToInfo)
checkNode := func(i int) {
nodeName := potentialNodes[i].Name
var metaCopy algorithm.PredicateMetadata
if meta != nil {
metaCopy = meta.ShallowCopy()
}
pods, numPDBViolations, fits := selectVictimsOnNode(pod, metaCopy, nodeNameToInfo[nodeName], predicates, queue, pdbs)
if fits {
resultLock.Lock()
victims := schedulerapi.Victims{
Pods: pods,
NumPDBViolations: numPDBViolations,
}
nodeToVictims[potentialNodes[i]] = &victims
resultLock.Unlock()
}
}
workqueue.Parallelize(16, len(potentialNodes), checkNode)
return nodeToVictims, nil
}
3.1.1. selectVictimsOnNode
selectVictimsOnNode找到应该被抢占的给定节点上的最小pod集合,以便给调度失败的pod安排足够的空间。该函数最终返回的是一个pod的数组。当有更低优先级的pod可能被选择的时候,较高优先级的pod不会被选入该待剔除的pod集合。
基本流程如下:
- 先检查当该节点上所有低于预被调度pod优先级的pod移除后,该pod能否被调度到当前节点上。
- 如果上述检查可以,则将该节点的所有低优先级pod按照优先级来排序。
// selectVictimsOnNode finds minimum set of pods on the given node that should
// be preempted in order to make enough room for "pod" to be scheduled. The
// minimum set selected is subject to the constraint that a higher-priority pod
// is never preempted when a lower-priority pod could be (higher/lower relative
// to one another, not relative to the preemptor "pod").
// The algorithm first checks if the pod can be scheduled on the node when all the
// lower priority pods are gone. If so, it sorts all the lower priority pods by
// their priority and then puts them into two groups of those whose PodDisruptionBudget
// will be violated if preempted and other non-violating pods. Both groups are
// sorted by priority. It first tries to reprieve as many PDB violating pods as
// possible and then does them same for non-PDB-violating pods while checking
// that the "pod" can still fit on the node.
// NOTE: This function assumes that it is never called if "pod" cannot be scheduled
// due to pod affinity, node affinity, or node anti-affinity reasons. None of
// these predicates can be satisfied by removing more pods from the node.
func selectVictimsOnNode(
pod *v1.Pod,
meta algorithm.PredicateMetadata,
nodeInfo *schedulercache.NodeInfo,
fitPredicates map[string]algorithm.FitPredicate,
queue SchedulingQueue,
pdbs []*policy.PodDisruptionBudget,
) ([]*v1.Pod, int, bool) {
potentialVictims := util.SortableList{CompFunc: util.HigherPriorityPod}
nodeInfoCopy := nodeInfo.Clone()
removePod := func(rp *v1.Pod) {
nodeInfoCopy.RemovePod(rp)
if meta != nil {
meta.RemovePod(rp)
}
}
addPod := func(ap *v1.Pod) {
nodeInfoCopy.AddPod(ap)
if meta != nil {
meta.AddPod(ap, nodeInfoCopy)
}
}
// As the first step, remove all the lower priority pods from the node and
// check if the given pod can be scheduled.
podPriority := util.GetPodPriority(pod)
for _, p := range nodeInfoCopy.Pods() {
if util.GetPodPriority(p) < podPriority {
potentialVictims.Items = append(potentialVictims.Items, p)
removePod(p)
}
}
potentialVictims.Sort()
// If the new pod does not fit after removing all the lower priority pods,
// we are almost done and this node is not suitable for preemption. The only condition
// that we should check is if the "pod" is failing to schedule due to pod affinity
// failure.
// TODO(bsalamat): Consider checking affinity to lower priority pods if feasible with reasonable performance.
if fits, _, err := podFitsOnNode(pod, meta, nodeInfoCopy, fitPredicates, nil, nil, queue, false, nil); !fits {
if err != nil {
glog.Warningf("Encountered error while selecting victims on node %v: %v", nodeInfo.Node().Name, err)
}
return nil, 0, false
}
var victims []*v1.Pod
numViolatingVictim := 0
// Try to reprieve as many pods as possible. We first try to reprieve the PDB
// violating victims and then other non-violating ones. In both cases, we start
// from the highest priority victims.
violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims.Items, pdbs)
reprievePod := func(p *v1.Pod) bool {
addPod(p)
fits, _, _ := podFitsOnNode(pod, meta, nodeInfoCopy, fitPredicates, nil, nil, queue, false, nil)
if !fits {
removePod(p)
victims = append(victims, p)
glog.V(5).Infof("Pod %v is a potential preemption victim on node %v.", p.Name, nodeInfo.Node().Name)
}
return fits
}
for _, p := range violatingVictims {
if !reprievePod(p) {
numViolatingVictim++
}
}
// Now we try to reprieve non-violating victims.
for _, p := range nonViolatingVictims {
reprievePod(p)
}
return victims, numViolatingVictim, true
}
3.2. processPreemptionWithExtenders
processPreemptionWithExtenders基于selectNodesForPreemption选出的牺牲者进行扩展的抢占逻辑继续筛选牺牲者。
// We will only check nodeToVictims with extenders that support preemption.
// Extenders which do not support preemption may later prevent preemptor from being scheduled on the nominated
// node. In that case, scheduler will find a different host for the preemptor in subsequent scheduling cycles.
nodeToVictims, err = g.processPreemptionWithExtenders(pod, nodeToVictims)
if err != nil {
return nil, nil, nil, err
}
processPreemptionWithExtenders完整代码如下:
// processPreemptionWithExtenders processes preemption with extenders
func (g *genericScheduler) processPreemptionWithExtenders(
pod *v1.Pod,
nodeToVictims map[*v1.Node]*schedulerapi.Victims,
) (map[*v1.Node]*schedulerapi.Victims, error) {
if len(nodeToVictims) > 0 {
for _, extender := range g.extenders {
if extender.SupportsPreemption() && extender.IsInterested(pod) {
newNodeToVictims, err := extender.ProcessPreemption(
pod,
nodeToVictims,
g.cachedNodeInfoMap,
)
if err != nil {
if extender.IsIgnorable() {
glog.Warningf("Skipping extender %v as it returned error %v and has ignorable flag set",
extender, err)
continue
}
return nil, err
}
// Replace nodeToVictims with new result after preemption. So the
// rest of extenders can continue use it as parameter.
nodeToVictims = newNodeToVictims
// If node list becomes empty, no preemption can happen regardless of other extenders.
if len(nodeToVictims) == 0 {
break
}
}
}
}
return nodeToVictims, nil
}
3.3. pickOneNodeForPreemption
pickOneNodeForPreemption从筛选出的node中再挑选一个节点作为最终调度节点。
candidateNode := pickOneNodeForPreemption(nodeToVictims)
if candidateNode == nil {
return nil, nil, nil, err
}
pickOneNodeForPreemption完整代码如下:
// pickOneNodeForPreemption chooses one node among the given nodes. It assumes
// pods in each map entry are ordered by decreasing priority.
// It picks a node based on the following criteria:
// 1. A node with minimum number of PDB violations.
// 2. A node with minimum highest priority victim is picked.
// 3. Ties are broken by sum of priorities of all victims.
// 4. If there are still ties, node with the minimum number of victims is picked.
// 5. If there are still ties, the first such node is picked (sort of randomly).
// The 'minNodes1' and 'minNodes2' are being reused here to save the memory
// allocation and garbage collection time.
func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*schedulerapi.Victims) *v1.Node {
if len(nodesToVictims) == 0 {
return nil
}
minNumPDBViolatingPods := math.MaxInt32
var minNodes1 []*v1.Node
lenNodes1 := 0
for node, victims := range nodesToVictims {
if len(victims.Pods) == 0 {
// We found a node that doesn't need any preemption. Return it!
// This should happen rarely when one or more pods are terminated between
// the time that scheduler tries to schedule the pod and the time that
// preemption logic tries to find nodes for preemption.
return node
}
numPDBViolatingPods := victims.NumPDBViolations
if numPDBViolatingPods < minNumPDBViolatingPods {
minNumPDBViolatingPods = numPDBViolatingPods
minNodes1 = nil
lenNodes1 = 0
}
if numPDBViolatingPods == minNumPDBViolatingPods {
minNodes1 = append(minNodes1, node)
lenNodes1++
}
}
if lenNodes1 == 1 {
return minNodes1[0]
}
// There are more than one node with minimum number PDB violating pods. Find
// the one with minimum highest priority victim.
minHighestPriority := int32(math.MaxInt32)
var minNodes2 = make([]*v1.Node, lenNodes1)
lenNodes2 := 0
for i := 0; i < lenNodes1; i++ {
node := minNodes1[i]
victims := nodesToVictims[node]
// highestPodPriority is the highest priority among the victims on this node.
highestPodPriority := util.GetPodPriority(victims.Pods[0])
if highestPodPriority < minHighestPriority {
minHighestPriority = highestPodPriority
lenNodes2 = 0
}
if highestPodPriority == minHighestPriority {
minNodes2[lenNodes2] = node
lenNodes2++
}
}
if lenNodes2 == 1 {
return minNodes2[0]
}
// There are a few nodes with minimum highest priority victim. Find the
// smallest sum of priorities.
minSumPriorities := int64(math.MaxInt64)
lenNodes1 = 0
for i := 0; i < lenNodes2; i++ {
var sumPriorities int64
node := minNodes2[i]
for _, pod := range nodesToVictims[node].Pods {
// We add MaxInt32+1 to all priorities to make all of them >= 0. This is
// needed so that a node with a few pods with negative priority is not
// picked over a node with a smaller number of pods with the same negative
// priority (and similar scenarios).
sumPriorities += int64(util.GetPodPriority(pod)) + int64(math.MaxInt32+1)
}
if sumPriorities < minSumPriorities {
minSumPriorities = sumPriorities
lenNodes1 = 0
}
if sumPriorities == minSumPriorities {
minNodes1[lenNodes1] = node
lenNodes1++
}
}
if lenNodes1 == 1 {
return minNodes1[0]
}
// There are a few nodes with minimum highest priority victim and sum of priorities.
// Find one with the minimum number of pods.
minNumPods := math.MaxInt32
lenNodes2 = 0
for i := 0; i < lenNodes1; i++ {
node := minNodes1[i]
numPods := len(nodesToVictims[node].Pods)
if numPods < minNumPods {
minNumPods = numPods
lenNodes2 = 0
}
if numPods == minNumPods {
minNodes2[lenNodes2] = node
lenNodes2++
}
}
// At this point, even if there are more than one node with the same score,
// return the first one.
if lenNodes2 > 0 {
return minNodes2[0]
}
glog.Errorf("Error in logic of node scoring for preemption. We should never reach here!")
return nil
}
3.4. getLowerPriorityNominatedPods
getLowerPriorityNominatedPods的基本流程如下:
- 获取候选节点上的pod列表。
- 获取待调度pod的优先级值。
- 遍历该节点的pod列表,如果低于待调度pod的优先级则放入低优先级pod列表中。
genericScheduler.Preempt中相关代码如下:
// Lower priority pods nominated to run on this node, may no longer fit on
// this node. So, we should remove their nomination. Removing their
// nomination updates these pods and moves them to the active queue. It
// lets scheduler find another place for them.
nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
if nodeInfo, ok := g.cachedNodeInfoMap[candidateNode.Name]; ok {
return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, err
}
getLowerPriorityNominatedPods代码如下:
此部分代码位于pkg/scheduler/core/generic_scheduler.go
// getLowerPriorityNominatedPods returns pods whose priority is smaller than the
// priority of the given "pod" and are nominated to run on the given node.
// Note: We could possibly check if the nominated lower priority pods still fit
// and return those that no longer fit, but that would require lots of
// manipulation of NodeInfo and PredicateMeta per nominated pod. It may not be
// worth the complexity, especially because we generally expect to have a very
// small number of nominated pods per node.
func (g *genericScheduler) getLowerPriorityNominatedPods(pod *v1.Pod, nodeName string) []*v1.Pod {
pods := g.schedulingQueue.WaitingPodsForNode(nodeName)
if len(pods) == 0 {
return nil
}
var lowerPriorityPods []*v1.Pod
podPriority := util.GetPodPriority(pod)
for _, p := range pods {
if util.GetPodPriority(p) < podPriority {
lowerPriorityPods = append(lowerPriorityPods, p)
}
}
return lowerPriorityPods
}
4. 总结
4.1. Scheduler.preempt
当pod调度失败的时候,会抢占低优先级pod的空间来给高优先级的pod。其中入参为调度失败的pod对象和调度失败的err。
抢占的基本流程如下:
- 判断是否有关闭抢占机制,如果关闭抢占机制则直接返回。
- 获取调度失败pod的最新对象数据。
- 执行抢占算法
Algorithm.Preempt,返回预调度节点和需要被剔除的pod列表。 - 将抢占算法返回的node添加到pod的
Status.NominatedNodeName中,并删除需要被剔除的pod。 - 当抢占算法返回的node是nil的时候,清除pod的
Status.NominatedNodeName信息。
整个抢占流程的最终结果实际上是更新Pod.Status.NominatedNodeName属性的信息。如果抢占算法返回的节点不为空,则将该node更新到Pod.Status.NominatedNodeName中,否则就将Pod.Status.NominatedNodeName设置为空。
4.2. genericScheduler.Preempt
Preempt的主要实现是找到可以调度的节点和上面因抢占而需要被剔除的pod。
基本流程如下:
- 根据调度失败的原因对所有节点先进行一批筛选,筛选出潜在的被调度节点列表。
- 通过
selectNodesForPreemption筛选出需要牺牲的pod和其节点。 - 基于拓展抢占逻辑再次对上述筛选出来的牺牲者做过滤。
- 基于上述的过滤结果,选择一个最终可能因抢占被调度的节点。
- 基于上述的候选节点,找出该节点上优先级低于当前被调度pod的牺牲者pod列表。
参考:
11.4.4 -
kube-scheduler源码分析(五)之 PrioritizeNodes
以下代码分析基于
kubernetes v1.12.0版本。
本文主要分析优选策略逻辑,即从预选的节点中选择出最优的节点。优选策略的具体实现函数为PrioritizeNodes。PrioritizeNodes最终返回是一个记录了各个节点分数的列表。
1. 调用入口
genericScheduler.Schedule中对PrioritizeNodes的调用过程如下:
此部分代码位于pkg/scheduler/core/generic_scheduler.go
func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
...
trace.Step("Prioritizing")
startPriorityEvalTime := time.Now()
// When only one node after predicate, just use it.
if len(filteredNodes) == 1 {
metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
return filteredNodes[0].Name, nil
}
metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
// 执行优选逻辑的操作,返回记录各个节点分数的列表
priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
if err != nil {
return "", err
}
metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))
...
}
核心代码:
// 基于预选节点filteredNodes进一步筛选优选的节点,返回记录各个节点分数的列表
priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
2. PrioritizeNodes
优选,从满足的节点中选择出最优的节点。PrioritizeNodes最终返回是一个记录了各个节点分数的列表。
具体操作如下:
- PrioritizeNodes通过并行运行各个优先级函数来对节点进行优先级排序。
- 每个优先级函数会给节点打分,打分范围为0-10分。
- 0 表示优先级最低的节点,10表示优先级最高的节点。
- 每个优先级函数也有各自的权重。
- 优先级函数返回的节点分数乘以权重以获得加权分数。
- 最后组合(添加)所有分数以获得所有节点的总加权分数。
PrioritizeNodes主要流程如下:
- 如果没有设置优选函数和拓展函数,则全部节点设置相同的分数,直接返回。
- 依次给node执行map函数进行打分。
- 再对上述map函数的执行结果执行reduce函数计算最终得分。
- 最后根据不同优先级函数的权重对得分取加权平均数。
入参:
- pod
- nodeNameToInfo
- meta interface{},
- priorityConfigs
- nodes
- extenders
出参:
- HostPriorityList:记录节点分数的列表。
HostPriority定义如下:
// HostPriority represents the priority of scheduling to a particular host, higher priority is better.
type HostPriority struct {
// Name of the host
Host string
// Score associated with the host
Score int
}
PrioritizeNodes完整代码如下:
此部分代码位于pkg/scheduler/core/generic_scheduler.go
// PrioritizeNodes prioritizes the nodes by running the individual priority functions in parallel.
// Each priority function is expected to set a score of 0-10
// 0 is the lowest priority score (least preferred node) and 10 is the highest
// Each priority function can also have its own weight
// The node scores returned by the priority function are multiplied by the weights to get weighted scores
// All scores are finally combined (added) to get the total weighted scores of all nodes
func PrioritizeNodes(
pod *v1.Pod,
nodeNameToInfo map[string]*schedulercache.NodeInfo,
meta interface{},
priorityConfigs []algorithm.PriorityConfig,
nodes []*v1.Node,
extenders []algorithm.SchedulerExtender,
) (schedulerapi.HostPriorityList, error) {
// If no priority configs are provided, then the EqualPriority function is applied
// This is required to generate the priority list in the required format
if len(priorityConfigs) == 0 && len(extenders) == 0 {
result := make(schedulerapi.HostPriorityList, 0, len(nodes))
for i := range nodes {
hostPriority, err := EqualPriorityMap(pod, meta, nodeNameToInfo[nodes[i].Name])
if err != nil {
return nil, err
}
result = append(result, hostPriority)
}
return result, nil
}
var (
mu = sync.Mutex{}
wg = sync.WaitGroup{}
errs []error
)
appendError := func(err error) {
mu.Lock()
defer mu.Unlock()
errs = append(errs, err)
}
results := make([]schedulerapi.HostPriorityList, len(priorityConfigs), len(priorityConfigs))
for i, priorityConfig := range priorityConfigs {
if priorityConfig.Function != nil {
// DEPRECATED
wg.Add(1)
go func(index int, config algorithm.PriorityConfig) {
defer wg.Done()
var err error
results[index], err = config.Function(pod, nodeNameToInfo, nodes)
if err != nil {
appendError(err)
}
}(i, priorityConfig)
} else {
results[i] = make(schedulerapi.HostPriorityList, len(nodes))
}
}
processNode := func(index int) {
nodeInfo := nodeNameToInfo[nodes[index].Name]
var err error
for i := range priorityConfigs {
if priorityConfigs[i].Function != nil {
continue
}
results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo)
if err != nil {
appendError(err)
return
}
}
}
workqueue.Parallelize(16, len(nodes), processNode)
for i, priorityConfig := range priorityConfigs {
if priorityConfig.Reduce == nil {
continue
}
wg.Add(1)
go func(index int, config algorithm.PriorityConfig) {
defer wg.Done()
if err := config.Reduce(pod, meta, nodeNameToInfo, results[index]); err != nil {
appendError(err)
}
if glog.V(10) {
for _, hostPriority := range results[index] {
glog.Infof("%v -> %v: %v, Score: (%d)", pod.Name, hostPriority.Host, config.Name, hostPriority.Score)
}
}
}(i, priorityConfig)
}
// Wait for all computations to be finished.
wg.Wait()
if len(errs) != 0 {
return schedulerapi.HostPriorityList{}, errors.NewAggregate(errs)
}
// Summarize all scores.
result := make(schedulerapi.HostPriorityList, 0, len(nodes))
for i := range nodes {
result = append(result, schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0})
for j := range priorityConfigs {
result[i].Score += results[j][i].Score * priorityConfigs[j].Weight
}
}
if len(extenders) != 0 && nodes != nil {
combinedScores := make(map[string]int, len(nodeNameToInfo))
for _, extender := range extenders {
if !extender.IsInterested(pod) {
continue
}
wg.Add(1)
go func(ext algorithm.SchedulerExtender) {
defer wg.Done()
prioritizedList, weight, err := ext.Prioritize(pod, nodes)
if err != nil {
// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
return
}
mu.Lock()
for i := range *prioritizedList {
host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
combinedScores[host] += score * weight
}
mu.Unlock()
}(extender)
}
// wait for all go routines to finish
wg.Wait()
for i := range result {
result[i].Score += combinedScores[result[i].Host]
}
}
if glog.V(10) {
for i := range result {
glog.V(10).Infof("Host %s => Score %d", result[i].Host, result[i].Score)
}
}
return result, nil
}
以下对PrioritizeNodes分段进行分析。
3. EqualPriorityMap
如果没有提供优选函数和拓展函数,则将所有的节点设置为相同的优先级,即节点的score都为1,然后直接返回结果。(但一般情况下优选函数列表都不为空)
// If no priority configs are provided, then the EqualPriority function is applied
// This is required to generate the priority list in the required format
if len(priorityConfigs) == 0 && len(extenders) == 0 {
result := make(schedulerapi.HostPriorityList, 0, len(nodes))
for i := range nodes {
hostPriority, err := EqualPriorityMap(pod, meta, nodeNameToInfo[nodes[i].Name])
if err != nil {
return nil, err
}
result = append(result, hostPriority)
}
return result, nil
}
EqualPriorityMap具体实现如下:
// EqualPriorityMap is a prioritizer function that gives an equal weight of one to all nodes
func EqualPriorityMap(_ *v1.Pod, _ interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
node := nodeInfo.Node()
if node == nil {
return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
}
return schedulerapi.HostPriority{
Host: node.Name,
Score: 1,
}, nil
}
4. processNode
processNode就是基于index拿出node的信息,调用之前注册的各种优选函数(此处是mapFunction),通过优选函数对node和pod进行处理,最后返回一个记录node分数的列表result。processNode同样也使用workqueue.Parallelize来进行并行处理。(processNode类似于预选逻辑findNodesThatFit中使用到的checkNode的作用)
其中优选函数是通过priorityConfigs来记录,每类优选函数包括PriorityMapFunction和PriorityReduceFunction两种函数。优选函数的注册部分可参考registerAlgorithmProvider。
processNode := func(index int) {
nodeInfo := nodeNameToInfo[nodes[index].Name]
var err error
for i := range priorityConfigs {
if priorityConfigs[i].Function != nil {
continue
}
results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo)
if err != nil {
appendError(err)
return
}
}
}
// 并行执行processNode
workqueue.Parallelize(16, len(nodes), processNode)
priorityConfigs定义如下:
核心属性:
- Map :PriorityMapFunction
- Reduce:PriorityReduceFunction
// PriorityConfig is a config used for a priority function.
type PriorityConfig struct {
Name string
Map PriorityMapFunction
Reduce PriorityReduceFunction
// TODO: Remove it after migrating all functions to
// Map-Reduce pattern.
Function PriorityFunction
Weight int
}
具体的优选函数处理逻辑待下文分析,本文会以NewSelectorSpreadPriority函数为例。
5. PriorityMapFunction
PriorityMapFunction是一个计算给定节点的每个节点结果的函数。
PriorityMapFunction定义如下:
// PriorityMapFunction is a function that computes per-node results for a given node.
// TODO: Figure out the exact API of this method.
// TODO: Change interface{} to a specific type.
type PriorityMapFunction func(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error)
PriorityMapFunction是在processNode中调用的,代码如下:
results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo)
下文会分析NewSelectorSpreadPriority在的map函数CalculateSpreadPriorityMap。
6. PriorityReduceFunction
PriorityReduceFunction是一个聚合每个节点结果并计算所有节点的最终得分的函数。
PriorityReduceFunction定义如下:
// PriorityReduceFunction is a function that aggregated per-node results and computes
// final scores for all nodes.
// TODO: Figure out the exact API of this method.
// TODO: Change interface{} to a specific type.
type PriorityReduceFunction func(pod *v1.Pod, meta interface{}, nodeNameToInfo map[string]*schedulercache.NodeInfo, result schedulerapi.HostPriorityList) error
PrioritizeNodes中对reduce函数调用部分如下:
for i, priorityConfig := range priorityConfigs {
if priorityConfig.Reduce == nil {
continue
}
wg.Add(1)
go func(index int, config algorithm.PriorityConfig) {
defer wg.Done()
if err := config.Reduce(pod, meta, nodeNameToInfo, results[index]); err != nil {
appendError(err)
}
if glog.V(10) {
for _, hostPriority := range results[index] {
glog.Infof("%v -> %v: %v, Score: (%d)", pod.Name, hostPriority.Host, config.Name, hostPriority.Score)
}
}
}(i, priorityConfig)
}
下文会分析NewSelectorSpreadPriority在的reduce函数CalculateSpreadPriorityReduce。
7. Summarize all scores
先等待计算完成再计算加权平均数。
// Wait for all computations to be finished.
wg.Wait()
if len(errs) != 0 {
return schedulerapi.HostPriorityList{}, errors.NewAggregate(errs)
}
计算所有节点的加权平均数。
// Summarize all scores.
result := make(schedulerapi.HostPriorityList, 0, len(nodes))
for i := range nodes {
result = append(result, schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0})
for j := range priorityConfigs {
result[i].Score += results[j][i].Score * priorityConfigs[j].Weight
}
}
当设置了拓展的计算方式,则增加拓展计算方式的加权平均数。
if len(extenders) != 0 && nodes != nil {
combinedScores := make(map[string]int, len(nodeNameToInfo))
for _, extender := range extenders {
if !extender.IsInterested(pod) {
continue
}
wg.Add(1)
go func(ext algorithm.SchedulerExtender) {
defer wg.Done()
prioritizedList, weight, err := ext.Prioritize(pod, nodes)
if err != nil {
// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
return
}
mu.Lock()
for i := range *prioritizedList {
host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
combinedScores[host] += score * weight
}
mu.Unlock()
}(extender)
}
// wait for all go routines to finish
wg.Wait()
for i := range result {
result[i].Score += combinedScores[result[i].Host]
}
}
8. NewSelectorSpreadPriority
以下以NewSelectorSpreadPriority这个优选函数来做分析,其他重要的优选函数待后续专门分析。
NewSelectorSpreadPriority主要的功能是将属于相同service和rs下的pod尽量分布在不同的node上。
该函数的注册代码如下:
此部分代码位于pkg/scheduler/algorithmprovider/defaults/defaults.go
// ServiceSpreadingPriority is a priority config factory that spreads pods by minimizing
// the number of pods (belonging to the same service) on the same node.
// Register the factory so that it's available, but do not include it as part of the default priorities
// Largely replaced by "SelectorSpreadPriority", but registered for backward compatibility with 1.0
factory.RegisterPriorityConfigFactory(
"ServiceSpreadingPriority",
factory.PriorityConfigFactory{
MapReduceFunction: func(args factory.PluginFactoryArgs) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
return priorities.NewSelectorSpreadPriority(args.ServiceLister, algorithm.EmptyControllerLister{}, algorithm.EmptyReplicaSetLister{}, algorithm.EmptyStatefulSetLister{})
},
Weight: 1,
},
)
NewSelectorSpreadPriority的具体实现如下:
此部分代码位于pkg/scheduler/algorithm/priorities/selector_spreading.go
// NewSelectorSpreadPriority creates a SelectorSpread.
func NewSelectorSpreadPriority(
serviceLister algorithm.ServiceLister,
controllerLister algorithm.ControllerLister,
replicaSetLister algorithm.ReplicaSetLister,
statefulSetLister algorithm.StatefulSetLister) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
selectorSpread := &SelectorSpread{
serviceLister: serviceLister,
controllerLister: controllerLister,
replicaSetLister: replicaSetLister,
statefulSetLister: statefulSetLister,
}
return selectorSpread.CalculateSpreadPriorityMap, selectorSpread.CalculateSpreadPriorityReduce
}
NewSelectorSpreadPriority主要包括map和reduce两种函数,分别对应CalculateSpreadPriorityMap,CalculateSpreadPriorityReduce。
8.1. CalculateSpreadPriorityMap
CalculateSpreadPriorityMap的主要作用是将相同service、RC、RS或statefulset的pod分布在不同的节点上。当调度一个pod的时候,先寻找与该pod匹配的service、RS、RC或statefulset,然后寻找与其selector匹配的已存在的pod,寻找存在这类pod最少的节点。
基本流程如下:
- 寻找与该pod对应的service、RS、RC、statefulset匹配的selector。
- 遍历当前节点的所有pod,将该节点上已存在的selector匹配到的pod的个数作为该节点的分数(此时,分数大的表示匹配到的pod越多,越不符合被调度的条件,该分数在reduce阶段会被按10分制处理成分数大的越符合被调度的条件)。
此部分代码位于pkg/scheduler/algorithm/priorities/selector_spreading.go
// CalculateSpreadPriorityMap spreads pods across hosts, considering pods
// belonging to the same service,RC,RS or StatefulSet.
// When a pod is scheduled, it looks for services, RCs,RSs and StatefulSets that match the pod,
// then finds existing pods that match those selectors.
// It favors nodes that have fewer existing matching pods.
// i.e. it pushes the scheduler towards a node where there's the smallest number of
// pods which match the same service, RC,RSs or StatefulSets selectors as the pod being scheduled.
func (s *SelectorSpread) CalculateSpreadPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
var selectors []labels.Selector
node := nodeInfo.Node()
if node == nil {
return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
}
priorityMeta, ok := meta.(*priorityMetadata)
if ok {
selectors = priorityMeta.podSelectors
} else {
selectors = getSelectors(pod, s.serviceLister, s.controllerLister, s.replicaSetLister, s.statefulSetLister)
}
if len(selectors) == 0 {
return schedulerapi.HostPriority{
Host: node.Name,
Score: int(0),
}, nil
}
count := int(0)
for _, nodePod := range nodeInfo.Pods() {
if pod.Namespace != nodePod.Namespace {
continue
}
// When we are replacing a failed pod, we often see the previous
// deleted version while scheduling the replacement.
// Ignore the previous deleted version for spreading purposes
// (it can still be considered for resource restrictions etc.)
if nodePod.DeletionTimestamp != nil {
glog.V(4).Infof("skipping pending-deleted pod: %s/%s", nodePod.Namespace, nodePod.Name)
continue
}
for _, selector := range selectors {
if selector.Matches(labels.Set(nodePod.ObjectMeta.Labels)) {
count++
break
}
}
}
return schedulerapi.HostPriority{
Host: node.Name,
Score: int(count),
}, nil
}
以下分段分析:
先获得selector。
selectors = getSelectors(pod, s.serviceLister, s.controllerLister, s.replicaSetLister, s.statefulSetLister)
计算节点上匹配selector的pod的个数,作为该节点分数,该分数并不是最终节点的分数,只是中间过渡的记录状态。
count := int(0)
for _, nodePod := range nodeInfo.Pods() {
...
for _, selector := range selectors {
if selector.Matches(labels.Set(nodePod.ObjectMeta.Labels)) {
count++
break
}
}
}
8.2. CalculateSpreadPriorityReduce
CalculateSpreadPriorityReduce根据节点上现有匹配pod的数量计算每个节点的十分制的分数,具有较少现有匹配pod的节点的分数越高,表示节点越可能被调度到。
基本流程如下:
- 记录所有节点中匹配到pod个数最多的节点的分数(即匹配到的pod最多的个数)。
- 遍历所有的节点,按比例取十分制的得分,计算方式为:(节点中最多匹配pod的个数-当前节点pod的个数)/节点中最多匹配pod的个数。此时,分数越高表示该节点上匹配到的pod的个数越少,越可能被调度到,即满足把相同selector的pod分散到不同节点的需求。
此部分代码位于pkg/scheduler/algorithm/priorities/selector_spreading.go
// CalculateSpreadPriorityReduce calculates the source of each node
// based on the number of existing matching pods on the node
// where zone information is included on the nodes, it favors nodes
// in zones with fewer existing matching pods.
func (s *SelectorSpread) CalculateSpreadPriorityReduce(pod *v1.Pod, meta interface{}, nodeNameToInfo map[string]*schedulercache.NodeInfo, result schedulerapi.HostPriorityList) error {
countsByZone := make(map[string]int, 10)
maxCountByZone := int(0)
maxCountByNodeName := int(0)
for i := range result {
if result[i].Score > maxCountByNodeName {
maxCountByNodeName = result[i].Score
}
zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
if zoneID == "" {
continue
}
countsByZone[zoneID] += result[i].Score
}
for zoneID := range countsByZone {
if countsByZone[zoneID] > maxCountByZone {
maxCountByZone = countsByZone[zoneID]
}
}
haveZones := len(countsByZone) != 0
maxCountByNodeNameFloat64 := float64(maxCountByNodeName)
maxCountByZoneFloat64 := float64(maxCountByZone)
MaxPriorityFloat64 := float64(schedulerapi.MaxPriority)
for i := range result {
// initializing to the default/max node score of maxPriority
fScore := MaxPriorityFloat64
if maxCountByNodeName > 0 {
fScore = MaxPriorityFloat64 * (float64(maxCountByNodeName-result[i].Score) / maxCountByNodeNameFloat64)
}
// If there is zone information present, incorporate it
if haveZones {
zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
if zoneID != "" {
zoneScore := MaxPriorityFloat64
if maxCountByZone > 0 {
zoneScore = MaxPriorityFloat64 * (float64(maxCountByZone-countsByZone[zoneID]) / maxCountByZoneFloat64)
}
fScore = (fScore * (1.0 - zoneWeighting)) + (zoneWeighting * zoneScore)
}
}
result[i].Score = int(fScore)
if glog.V(10) {
glog.Infof(
"%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, result[i].Host, int(fScore),
)
}
}
return nil
}
以下分段分析:
先获取所有节点中匹配到的pod最多的个数。
for i := range result {
if result[i].Score > maxCountByNodeName {
maxCountByNodeName = result[i].Score
}
zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
if zoneID == "" {
continue
}
countsByZone[zoneID] += result[i].Score
}
遍历所有的节点,按比例取十分制的得分。
for i := range result {
// initializing to the default/max node score of maxPriority
fScore := MaxPriorityFloat64
if maxCountByNodeName > 0 {
fScore = MaxPriorityFloat64 * (float64(maxCountByNodeName-result[i].Score) / maxCountByNodeNameFloat64)
}
...
}
9. 总结
优选,从满足的节点中选择出最优的节点。PrioritizeNodes最终返回是一个记录了各个节点分数的列表。
9.1. PrioritizeNodes
主要流程如下:
- 如果没有设置优选函数和拓展函数,则全部节点设置相同的分数,直接返回。
- 依次给node执行map函数进行打分。
- 再对上述map函数的执行结果执行reduce函数计算最终得分。
- 最后根据不同优先级函数的权重对得分取加权平均数。
其中每类优选函数会包含map函数和reduce函数两种。
9.2. NewSelectorSpreadPriority
其中以NewSelectorSpreadPriority这个优选函数为例作分析,该函数的功能是将相同service、RS、RC或statefulset下pod尽量分散到不同的节点上。包括map函数和reduce函数两部分,具体如下。
9.2.1. CalculateSpreadPriorityMap
基本流程如下:
- 寻找与该pod对应的service、RS、RC、statefulset匹配的selector。
- 遍历当前节点的所有pod,将该节点上已存在的selector匹配到的pod的个数作为该节点的分数(此时,分数大的表示匹配到的pod越多,越不符合被调度的条件,该分数在reduce阶段会被按10分制处理成分数大的越符合被调度的条件)。
9.2.2. CalculateSpreadPriorityReduce
基本流程如下:
- 记录所有节点中匹配到pod个数最多的节点的分数(即匹配到的pod最多的个数)。
- 遍历所有的节点,按比例取十分制的得分,计算方式为:(节点中最多匹配pod的个数-当前节点pod的个数)/节点中最多匹配pod的个数。此时,分数越高表示该节点上匹配到的pod的个数越少,越可能被调度到,即满足把相同selector的pod分散到不同节点的需求。
参考:
11.4.5 -
kube-scheduler源码分析(二)之 registerAlgorithmProvider
以下代码分析基于
kubernetes v1.12.0版本。
此部分主要介绍调度中使用的各种调度算法,包括调度算法的注册部分。注册部分的代码主要在/pkg/scheduler/algorithmprovider中,具体的预选策略和优选策略的算法实现在/pkg/scheduler/algorithm中。
1. ApplyFeatureGates
注册调度算法的调用入口在SchedulerCommand的Run函数中。
此部分代码位于/cmd/kube-scheduler/app/server.go
// Run runs the Scheduler.
func Run(c schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}) error {
...
// Apply algorithms based on feature gates.
// TODO: make configurable?
algorithmprovider.ApplyFeatureGates()
...
}
ApplyFeatureGates的具体实现在pkg/scheduler/algorithmprovider的包中。
此部分代码位于/pkg/scheduler/algorithmprovider/plugins.go
// ApplyFeatureGates applies algorithm by feature gates.
func ApplyFeatureGates() {
defaults.ApplyFeatureGates()
}
ApplyFeatureGates具体实现如下:
此部分代码位于/pkg/scheduler/algorithmprovider/defaults/defaults.go
根据feature移除部分调度策略。
// ApplyFeatureGates applies algorithm by feature gates.
func ApplyFeatureGates() {
if utilfeature.DefaultFeatureGate.Enabled(features.TaintNodesByCondition) {
// Remove "CheckNodeCondition", "CheckNodeMemoryPressure", "CheckNodePIDPressurePred"
// and "CheckNodeDiskPressure" predicates
factory.RemoveFitPredicate(predicates.CheckNodeConditionPred)
factory.RemoveFitPredicate(predicates.CheckNodeMemoryPressurePred)
factory.RemoveFitPredicate(predicates.CheckNodeDiskPressurePred)
factory.RemoveFitPredicate(predicates.CheckNodePIDPressurePred)
// Remove key "CheckNodeCondition", "CheckNodeMemoryPressure" and "CheckNodeDiskPressure"
// from ALL algorithm provider
// The key will be removed from all providers which in algorithmProviderMap[]
// if you just want remove specific provider, call func RemovePredicateKeyFromAlgoProvider()
factory.RemovePredicateKeyFromAlgorithmProviderMap(predicates.CheckNodeConditionPred)
factory.RemovePredicateKeyFromAlgorithmProviderMap(predicates.CheckNodeMemoryPressurePred)
factory.RemovePredicateKeyFromAlgorithmProviderMap(predicates.CheckNodeDiskPressurePred)
factory.RemovePredicateKeyFromAlgorithmProviderMap(predicates.CheckNodePIDPressurePred)
// Fit is determined based on whether a pod can tolerate all of the node's taints
factory.RegisterMandatoryFitPredicate(predicates.PodToleratesNodeTaintsPred, predicates.PodToleratesNodeTaints)
// Fit is determined based on whether a pod can tolerate unschedulable of node
factory.RegisterMandatoryFitPredicate(predicates.CheckNodeUnschedulablePred, predicates.CheckNodeUnschedulablePredicate)
// Insert Key "PodToleratesNodeTaints" and "CheckNodeUnschedulable" To All Algorithm Provider
// The key will insert to all providers which in algorithmProviderMap[]
// if you just want insert to specific provider, call func InsertPredicateKeyToAlgoProvider()
factory.InsertPredicateKeyToAlgorithmProviderMap(predicates.PodToleratesNodeTaintsPred)
factory.InsertPredicateKeyToAlgorithmProviderMap(predicates.CheckNodeUnschedulablePred)
glog.Warningf("TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory")
}
// Prioritizes nodes that satisfy pod's resource limits
if utilfeature.DefaultFeatureGate.Enabled(features.ResourceLimitsPriorityFunction) {
factory.RegisterPriorityFunction2("ResourceLimitsPriority", priorities.ResourceLimitsPriorityMap, nil, 1)
}
}
2. init
当函数逻辑调用到algorithmprovider包时,就会自动调用init的初始化函数,此部分主要包括对预选算法和优选算法的注册。
此部分代码位于/pkg/scheduler/algorithmprovider/defaults/defaults.go
func init() {
// Register functions that extract metadata used by predicates and priorities computations.
factory.RegisterPredicateMetadataProducerFactory(
func(args factory.PluginFactoryArgs) algorithm.PredicateMetadataProducer {
return predicates.NewPredicateMetadataFactory(args.PodLister)
})
factory.RegisterPriorityMetadataProducerFactory(
func(args factory.PluginFactoryArgs) algorithm.PriorityMetadataProducer {
return priorities.NewPriorityMetadataFactory(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
})
registerAlgorithmProvider(defaultPredicates(), defaultPriorities())
// IMPORTANT NOTES for predicate developers:
// We are using cached predicate result for pods belonging to the same equivalence class.
// So when implementing a new predicate, you are expected to check whether the result
// of your predicate function can be affected by related API object change (ADD/DELETE/UPDATE).
// If yes, you are expected to invalidate the cached predicate result for related API object change.
// For example:
// https://github.com/kubernetes/kubernetes/blob/36a218e/plugin/pkg/scheduler/factory/factory.go#L422
// Registers predicates and priorities that are not enabled by default, but user can pick when creating their
// own set of priorities/predicates.
// PodFitsPorts has been replaced by PodFitsHostPorts for better user understanding.
// For backwards compatibility with 1.0, PodFitsPorts is registered as well.
factory.RegisterFitPredicate("PodFitsPorts", predicates.PodFitsHostPorts)
// Fit is defined based on the absence of port conflicts.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate(predicates.PodFitsHostPortsPred, predicates.PodFitsHostPorts)
// Fit is determined by resource availability.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)
// Fit is determined by the presence of the Host parameter and a string match
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate(predicates.HostNamePred, predicates.PodFitsHost)
// Fit is determined by node selector query.
factory.RegisterFitPredicate(predicates.MatchNodeSelectorPred, predicates.PodMatchNodeSelector)
// ServiceSpreadingPriority is a priority config factory that spreads pods by minimizing
// the number of pods (belonging to the same service) on the same node.
// Register the factory so that it's available, but do not include it as part of the default priorities
// Largely replaced by "SelectorSpreadPriority", but registered for backward compatibility with 1.0
factory.RegisterPriorityConfigFactory(
"ServiceSpreadingPriority",
factory.PriorityConfigFactory{
MapReduceFunction: func(args factory.PluginFactoryArgs) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
return priorities.NewSelectorSpreadPriority(args.ServiceLister, algorithm.EmptyControllerLister{}, algorithm.EmptyReplicaSetLister{}, algorithm.EmptyStatefulSetLister{})
},
Weight: 1,
},
)
// EqualPriority is a prioritizer function that gives an equal weight of one to all nodes
// Register the priority function so that its available
// but do not include it as part of the default priorities
factory.RegisterPriorityFunction2("EqualPriority", core.EqualPriorityMap, nil, 1)
// Optional, cluster-autoscaler friendly priority function - give used nodes higher priority.
factory.RegisterPriorityFunction2("MostRequestedPriority", priorities.MostRequestedPriorityMap, nil, 1)
factory.RegisterPriorityFunction2(
"RequestedToCapacityRatioPriority",
priorities.RequestedToCapacityRatioResourceAllocationPriorityDefault().PriorityMap,
nil,
1)
}
以下对init中的注册进行拆分介绍。
2.1. registerAlgorithmProvider
此部分主要注册默认的预选和优选策略。
// Register functions that extract metadata used by predicates and priorities computations.
factory.RegisterPredicateMetadataProducerFactory(
func(args factory.PluginFactoryArgs) algorithm.PredicateMetadataProducer {
return predicates.NewPredicateMetadataFactory(args.PodLister)
})
factory.RegisterPriorityMetadataProducerFactory(
func(args factory.PluginFactoryArgs) algorithm.PriorityMetadataProducer {
return priorities.NewPriorityMetadataFactory(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
})
registerAlgorithmProvider(defaultPredicates(), defaultPriorities())
registerAlgorithmProvider
注册AlgorithmProvider,其中包括DefaultProvider和ClusterAutoscalerProvider。
func registerAlgorithmProvider(predSet, priSet sets.String) {
// Registers algorithm providers. By default we use 'DefaultProvider', but user can specify one to be used
// by specifying flag.
factory.RegisterAlgorithmProvider(factory.DefaultProvider, predSet, priSet)
// Cluster autoscaler friendly scheduling algorithm.
factory.RegisterAlgorithmProvider(ClusterAutoscalerProvider, predSet,
copyAndReplace(priSet, "LeastRequestedPriority", "MostRequestedPriority"))
}
2.2. RegisterFitPredicate
在init部分注册预选策略函数。
预选策略如下:
| 调度策略 | 函数 | 描述 |
|---|---|---|
| PodFitsPorts | PodFitsHostPorts | PodFitsPorts已经被PodFitsHostPorts代替,此处主要是为了兼容性。 |
| PodFitsHostPortsPred | PodFitsHostPorts | 判断是否与宿主机的端口冲突。 |
| PodFitsResourcesPred | PodFitsResources | 判断node资源是否充足。 |
| HostNamePred | PodFitsHost | 判断pod所指定调度的节点是否是当前的节点。 |
| MatchNodeSelectorPred | PodMatchNodeSelector | 判断pod指定的node selector是否匹配当前的node。 |
具体代码如下:
// PodFitsPorts has been replaced by PodFitsHostPorts for better user understanding.
// For backwards compatibility with 1.0, PodFitsPorts is registered as well.
factory.RegisterFitPredicate("PodFitsPorts", predicates.PodFitsHostPorts)
// Fit is defined based on the absence of port conflicts.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate(predicates.PodFitsHostPortsPred, predicates.PodFitsHostPorts)
// Fit is determined by resource availability.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)
// Fit is determined by the presence of the Host parameter and a string match
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate(predicates.HostNamePred, predicates.PodFitsHost)
// Fit is determined by node selector query.
factory.RegisterFitPredicate(predicates.MatchNodeSelectorPred, predicates.PodMatchNodeSelector)
2.3. RegisterPriorityFunction2
在init部分注册优选策略函数。
// EqualPriority is a prioritizer function that gives an equal weight of one to all nodes
// Register the priority function so that its available
// but do not include it as part of the default priorities
factory.RegisterPriorityFunction2("EqualPriority", core.EqualPriorityMap, nil, 1)
// Optional, cluster-autoscaler friendly priority function - give used nodes higher priority.
factory.RegisterPriorityFunction2("MostRequestedPriority", priorities.MostRequestedPriorityMap, nil, 1)
factory.RegisterPriorityFunction2(
"RequestedToCapacityRatioPriority",
priorities.RequestedToCapacityRatioResourceAllocationPriorityDefault().PriorityMap,
nil,
1)
3. defaultPredicates
此部分为默认预选策略的注册函数。
默认的预选策略如下:
| 预选策略 | 函数 | 描述 |
|---|---|---|
| NoVolumeZoneConflictPred | NewVolumeZonePredicate | 判断pod使用到的volume是否有节点的要求。目前只支持pvc。 |
| MaxEBSVolumeCountPred | NewMaxPDVolumeCountPredicate | 判断pod使用EBSVolume在该节点上是否已经达到上限了。 |
| MaxGCEPDVolumeCountPred | NewMaxPDVolumeCountPredicate | 判断pod使用GCEPDVolume在该节点上是否已经达到上限了。 |
| MaxAzureDiskVolumeCountPred | NewMaxPDVolumeCountPredicate | 判断pod使用AzureDiskVolume在该节点上是否已经达到上限了。 |
| MaxCSIVolumeCountPred | NewCSIMaxVolumeLimitPredicate | 判断CSIVolume是否达到上限了。 |
| MatchInterPodAffinityPred | NewPodAffinityPredicate | 匹配pod的亲缘性。 |
| NoDiskConflictPred | NoDiskConflict | 判断是否有disk volumes的冲突。 |
| GeneralPred | GeneralPredicates | 通用的预选策略 |
| CheckNodeMemoryPressurePred | CheckNodeMemoryPressurePredicate | 判断节点内存是否充足。 |
| CheckNodeDiskPressurePred | CheckNodeDiskPressurePredicate | 判断节点是否有磁盘压力。 |
| CheckNodePIDPressurePred | CheckNodePIDPressurePredicate | 判断节点上的PID |
| CheckNodeConditionPred | CheckNodeConditionPredicate | 判断node是否ready。 |
| PodToleratesNodeTaintsPred | PodToleratesNodeTaints | 判断pod是否可以容忍节点的taints。 |
| CheckVolumeBindingPred | NewVolumeBindingPredicate | 判断是否有volume拓扑的要求。 |
具体代码如下:
func defaultPredicates() sets.String {
return sets.NewString(
// Fit is determined by volume zone requirements.
factory.RegisterFitPredicateFactory(
predicates.NoVolumeZoneConflictPred,
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewVolumeZonePredicate(args.PVInfo, args.PVCInfo, args.StorageClassInfo)
},
),
// Fit is determined by whether or not there would be too many AWS EBS volumes attached to the node
factory.RegisterFitPredicateFactory(
predicates.MaxEBSVolumeCountPred,
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilterType, args.PVInfo, args.PVCInfo)
},
),
// Fit is determined by whether or not there would be too many GCE PD volumes attached to the node
factory.RegisterFitPredicateFactory(
predicates.MaxGCEPDVolumeCountPred,
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewMaxPDVolumeCountPredicate(predicates.GCEPDVolumeFilterType, args.PVInfo, args.PVCInfo)
},
),
// Fit is determined by whether or not there would be too many Azure Disk volumes attached to the node
factory.RegisterFitPredicateFactory(
predicates.MaxAzureDiskVolumeCountPred,
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewMaxPDVolumeCountPredicate(predicates.AzureDiskVolumeFilterType, args.PVInfo, args.PVCInfo)
},
),
factory.RegisterFitPredicateFactory(
predicates.MaxCSIVolumeCountPred,
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewCSIMaxVolumeLimitPredicate(args.PVInfo, args.PVCInfo)
},
),
// Fit is determined by inter-pod affinity.
factory.RegisterFitPredicateFactory(
predicates.MatchInterPodAffinityPred,
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewPodAffinityPredicate(args.NodeInfo, args.PodLister)
},
),
// Fit is determined by non-conflicting disk volumes.
factory.RegisterFitPredicate(predicates.NoDiskConflictPred, predicates.NoDiskConflict),
// GeneralPredicates are the predicates that are enforced by all Kubernetes components
// (e.g. kubelet and all schedulers)
factory.RegisterFitPredicate(predicates.GeneralPred, predicates.GeneralPredicates),
// Fit is determined by node memory pressure condition.
factory.RegisterFitPredicate(predicates.CheckNodeMemoryPressurePred, predicates.CheckNodeMemoryPressurePredicate),
// Fit is determined by node disk pressure condition.
factory.RegisterFitPredicate(predicates.CheckNodeDiskPressurePred, predicates.CheckNodeDiskPressurePredicate),
// Fit is determined by node pid pressure condition.
factory.RegisterFitPredicate(predicates.CheckNodePIDPressurePred, predicates.CheckNodePIDPressurePredicate),
// Fit is determined by node conditions: not ready, network unavailable or out of disk.
factory.RegisterMandatoryFitPredicate(predicates.CheckNodeConditionPred, predicates.CheckNodeConditionPredicate),
// Fit is determined based on whether a pod can tolerate all of the node's taints
factory.RegisterFitPredicate(predicates.PodToleratesNodeTaintsPred, predicates.PodToleratesNodeTaints),
// Fit is determined by volume topology requirements.
factory.RegisterFitPredicateFactory(
predicates.CheckVolumeBindingPred,
func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
return predicates.NewVolumeBindingPredicate(args.VolumeBinder)
},
),
)
}
4. defaultPriorities
此部分主要为默认优选策略的注册函数。
默认优选策略如下:
| 优选策略 | 函数 | 描述 |
|---|---|---|
| SelectorSpreadPriority | NewSelectorSpreadPriority | 属于相同service和rs下的pod尽量分布在不同的node上。 |
| InterPodAffinityPriority | NewInterPodAffinityPriority | 根据pod的亲缘性,将相同拓扑域中的pod放在同一个节点 |
| LeastRequestedPriority | LeastRequestedPriorityMap | 按最少请求的利用率对节点进行优先级排序。 |
| BalancedResourceAllocation | BalancedResourceAllocationMap | 实现资源的平衡使用。 |
| NodePreferAvoidPodsPriority | CalculateNodePreferAvoidPodsPriorityMap | 将此权重设置为足以覆盖所有其他优先级函数。 |
| NodeAffinityPriority | CalculateNodeAffinityPriorityMap | pod指定label节点调度,来匹配node亲缘性。 |
| TaintTolerationPriority | ComputeTaintTolerationPriorityMap | pod有设置tolerate属性来容忍node的taint。 |
| ImageLocalityPriority | ImageLocalityPriorityMap | 根据节点上是否有该pod使用到的镜像打分。 |
具体代码实现如下:
func defaultPriorities() sets.String {
return sets.NewString(
// spreads pods by minimizing the number of pods (belonging to the same service or replication controller) on the same node.
factory.RegisterPriorityConfigFactory(
"SelectorSpreadPriority",
factory.PriorityConfigFactory{
MapReduceFunction: func(args factory.PluginFactoryArgs) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
return priorities.NewSelectorSpreadPriority(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
},
Weight: 1,
},
),
// pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
// as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
factory.RegisterPriorityConfigFactory(
"InterPodAffinityPriority",
factory.PriorityConfigFactory{
Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
return priorities.NewInterPodAffinityPriority(args.NodeInfo, args.NodeLister, args.PodLister, args.HardPodAffinitySymmetricWeight)
},
Weight: 1,
},
),
// Prioritize nodes by least requested utilization.
factory.RegisterPriorityFunction2("LeastRequestedPriority", priorities.LeastRequestedPriorityMap, nil, 1),
// Prioritizes nodes to help achieve balanced resource usage
factory.RegisterPriorityFunction2("BalancedResourceAllocation", priorities.BalancedResourceAllocationMap, nil, 1),
// Set this weight large enough to override all other priority functions.
// TODO: Figure out a better way to do this, maybe at same time as fixing #24720.
factory.RegisterPriorityFunction2("NodePreferAvoidPodsPriority", priorities.CalculateNodePreferAvoidPodsPriorityMap, nil, 10000),
// Prioritizes nodes that have labels matching NodeAffinity
factory.RegisterPriorityFunction2("NodeAffinityPriority", priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1),
// Prioritizes nodes that marked with taint which pod can tolerate.
factory.RegisterPriorityFunction2("TaintTolerationPriority", priorities.ComputeTaintTolerationPriorityMap, priorities.ComputeTaintTolerationPriorityReduce, 1),
// ImageLocalityPriority prioritizes nodes that have images requested by the pod present.
factory.RegisterPriorityFunction2("ImageLocalityPriority", priorities.ImageLocalityPriorityMap, nil, 1),
)
}
参考:
11.4.6 -
kube-scheduler源码分析(三)之 scheduleOne
以下代码分析基于
kubernetes v1.12.0版本。
本文主要分析/pkg/scheduler/中调度的基本流程。具体的预选调度逻辑、优选调度逻辑、节点抢占逻辑待后续再独立分析。
scheduler的pkg代码目录结构如下:
scheduler
├── algorithm # 主要包含调度的算法
│ ├── predicates # 预选的策略
│ ├── priorities # 优选的策略
│ ├── scheduler_interface.go # ScheduleAlgorithm、SchedulerExtender接口定义
│ ├── types.go # 使用到的type的定义
├── algorithmprovider
│ ├── defaults
│ │ ├── defaults.go # 默认算法的初始化操作,包括预选和优选策略
├── cache # scheduler调度使用到的cache
│ ├── cache.go # schedulerCache
│ ├── interface.go
│ ├── node_info.go
│ ├── node_tree.go
├── core # 调度逻辑的核心代码
│ ├── equivalence
│ │ ├── eqivalence.go # 存储相同pod的调度结果缓存,主要给预选策略使用
│ ├── extender.go
│ ├── generic_scheduler.go # genericScheduler,主要包含默认调度器的调度逻辑
│ ├── scheduling_queue.go # 调度使用到的队列,主要用来存储需要被调度的pod
├── factory
│ ├── factory.go # 主要包括NewConfigFactory、NewPodInformer,监听pod事件来更新调度队列
├── metrics
│ └── metrics.go # 主要给prometheus使用
├── scheduler.go # pkg部分的Run入口(核心代码),主要包含Run、scheduleOne、schedule、preempt等函数
└── volumebinder
└── volume_binder.go # volume bind
1. Scheduler.Run
此部分代码位于pkg/scheduler/scheduler.go
此处为具体调度逻辑的入口。
// Run begins watching and scheduling. It waits for cache to be synced, then starts a goroutine and returns immediately.
func (sched *Scheduler) Run() {
if !sched.config.WaitForCacheSync() {
return
}
go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)
}
2. Scheduler.scheduleOne
此部分代码位于pkg/scheduler/scheduler.go
scheduleOne主要为单个pod选择一个适合的节点,为调度逻辑的核心函数。
对单个pod进行调度的基本流程如下:
- 通过podQueue的待调度队列中弹出需要调度的pod。
- 通过具体的调度算法为该pod选出合适的节点,其中调度算法就包括预选和优选两步策略。
- 如果上述调度失败,则会尝试抢占机制,将优先级低的pod剔除,让优先级高的pod调度成功。
- 将该pod和选定的节点进行假性绑定,存入scheduler cache中,方便具体绑定操作可以异步进行。
- 实际执行绑定操作,将node的名字添加到pod的节点相关属性中。
完整代码如下:
// scheduleOne does the entire scheduling workflow for a single pod. It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne() {
pod := sched.config.NextPod()
if pod.DeletionTimestamp != nil {
sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
glog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
return
}
glog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)
// Synchronously attempt to find a fit for the pod.
start := time.Now()
suggestedHost, err := sched.schedule(pod)
if err != nil {
// schedule() may have failed because the pod would not fit on any host, so we try to
// preempt, with the expectation that the next time the pod is tried for scheduling it
// will fit due to the preemption. It is also possible that a different pod will schedule
// into the resources that were preempted, but this is harmless.
if fitError, ok := err.(*core.FitError); ok {
preemptionStartTime := time.Now()
sched.preempt(pod, fitError)
metrics.PreemptionAttempts.Inc()
metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
}
return
}
metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInMicroseconds(start))
// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
// This allows us to keep scheduling without waiting on binding to occur.
assumedPod := pod.DeepCopy()
// Assume volumes first before assuming the pod.
//
// If all volumes are completely bound, then allBound is true and binding will be skipped.
//
// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
//
// This function modifies 'assumedPod' if volume binding is required.
allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
if err != nil {
return
}
// assume modifies `assumedPod` by setting NodeName=suggestedHost
err = sched.assume(assumedPod, suggestedHost)
if err != nil {
return
}
// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
go func() {
// Bind volumes first before Pod
if !allBound {
err = sched.bindVolumes(assumedPod)
if err != nil {
return
}
}
err := sched.bind(assumedPod, &v1.Binding{
ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
Target: v1.ObjectReference{
Kind: "Node",
Name: suggestedHost,
},
})
metrics.E2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
if err != nil {
glog.Errorf("Internal error binding pod: (%v)", err)
}
}()
}
以下对重要代码分别进行分析。
3. config.NextPod
通过podQueue的方式存储待调度的pod队列,NextPod拿出下一个需要被调度的pod。
pod := sched.config.NextPod()
if pod.DeletionTimestamp != nil {
sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
glog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
return
}
glog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)
NextPod的具体函数在factory.go的CreateFromKey函数中定义,如下:
func (c *configFactory) CreateFromKeys(predicateKeys, priorityKeys sets.String, extenders []algorithm.SchedulerExtender) (*scheduler.Config, error) {
...
return &scheduler.Config{
...
NextPod: func() *v1.Pod {
return c.getNextPod()
}
...
}
3.1. getNextPod
通过一个podQueue来存储需要调度的pod的队列,通过队列Pop的方式弹出需要被调度的pod。
func (c *configFactory) getNextPod() *v1.Pod {
pod, err := c.podQueue.Pop()
if err == nil {
glog.V(4).Infof("About to try and schedule pod %v/%v", pod.Namespace, pod.Name)
return pod
}
glog.Errorf("Error while retrieving next pod from scheduling queue: %v", err)
return nil
}
4. Scheduler.schedule
此部分代码位于pkg/scheduler/scheduler.go
此部分为调度逻辑的核心,通过不同的算法为具体的pod选择一个最合适的节点。
// Synchronously attempt to find a fit for the pod.
start := time.Now()
suggestedHost, err := sched.schedule(pod)
if err != nil {
// schedule() may have failed because the pod would not fit on any host, so we try to
// preempt, with the expectation that the next time the pod is tried for scheduling it
// will fit due to the preemption. It is also possible that a different pod will schedule
// into the resources that were preempted, but this is harmless.
if fitError, ok := err.(*core.FitError); ok {
preemptionStartTime := time.Now()
sched.preempt(pod, fitError)
metrics.PreemptionAttempts.Inc()
metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
}
return
}
schedule通过调度算法返回一个最优的节点。
// schedule implements the scheduling algorithm and returns the suggested host.
func (sched *Scheduler) schedule(pod *v1.Pod) (string, error) {
host, err := sched.config.Algorithm.Schedule(pod, sched.config.NodeLister)
if err != nil {
pod = pod.DeepCopy()
sched.config.Error(pod, err)
sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "%v", err)
sched.config.PodConditionUpdater.Update(pod, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: v1.PodReasonUnschedulable,
Message: err.Error(),
})
return "", err
}
return host, err
}
4.1. ScheduleAlgorithm
ScheduleAlgorithm是一个调度算法的接口,主要的实现体是genericScheduler,后续分析genericScheduler.Schedule。
ScheduleAlgorithm接口定义如下:
// ScheduleAlgorithm is an interface implemented by things that know how to schedule pods
// onto machines.
type ScheduleAlgorithm interface {
Schedule(*v1.Pod, NodeLister) (selectedMachine string, err error)
// Preempt receives scheduling errors for a pod and tries to create room for
// the pod by preempting lower priority pods if possible.
// It returns the node where preemption happened, a list of preempted pods, a
// list of pods whose nominated node name should be removed, and error if any.
Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, cleanupNominatedPods []*v1.Pod, err error)
// Predicates() returns a pointer to a map of predicate functions. This is
// exposed for testing.
Predicates() map[string]FitPredicate
// Prioritizers returns a slice of priority config. This is exposed for
// testing.
Prioritizers() []PriorityConfig
}
5. genericScheduler.Schedule
此部分代码位于/pkg/scheduler/core/generic_scheduler.go
genericScheduler.Schedule实现了基本的调度逻辑,基于给定需要调度的pod和node列表,如果执行成功返回调度的节点的名字,如果执行失败,则返回错误和原因。主要通过预选和优选两步操作完成调度的逻辑。
基本流程如下:
- 对pod做基本性检查,目前主要是对pvc的检查。
- 通过
findNodesThatFit预选策略选出满足调度条件的node列表。 - 通过
PrioritizeNodes优选策略给预选的node列表中的node进行打分。 - 在打分的node列表中选择一个分数最高的node作为调度的节点。
完整代码如下:
// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.
func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
trace := utiltrace.New(fmt.Sprintf("Scheduling %s/%s", pod.Namespace, pod.Name))
defer trace.LogIfLong(100 * time.Millisecond)
if err := podPassesBasicChecks(pod, g.pvcLister); err != nil {
return "", err
}
nodes, err := nodeLister.List()
if err != nil {
return "", err
}
if len(nodes) == 0 {
return "", ErrNoNodesAvailable
}
// Used for all fit and priority funcs.
err = g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
if err != nil {
return "", err
}
trace.Step("Computing predicates")
startPredicateEvalTime := time.Now()
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
if err != nil {
return "", err
}
if len(filteredNodes) == 0 {
return "", &FitError{
Pod: pod,
NumAllNodes: len(nodes),
FailedPredicates: failedPredicateMap,
}
}
metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))
trace.Step("Prioritizing")
startPriorityEvalTime := time.Now()
// When only one node after predicate, just use it.
if len(filteredNodes) == 1 {
metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
return filteredNodes[0].Name, nil
}
metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
if err != nil {
return "", err
}
metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))
trace.Step("Selecting host")
return g.selectHost(priorityList)
}
5.1. podPassesBasicChecks
podPassesBasicChecks主要做一下基本性检查,目前主要是对pvc的检查。
if err := podPassesBasicChecks(pod, g.pvcLister); err != nil {
return "", err
}
podPassesBasicChecks具体实现如下:
// podPassesBasicChecks makes sanity checks on the pod if it can be scheduled.
func podPassesBasicChecks(pod *v1.Pod, pvcLister corelisters.PersistentVolumeClaimLister) error {
// Check PVCs used by the pod
namespace := pod.Namespace
manifest := &(pod.Spec)
for i := range manifest.Volumes {
volume := &manifest.Volumes[i]
if volume.PersistentVolumeClaim == nil {
// Volume is not a PVC, ignore
continue
}
pvcName := volume.PersistentVolumeClaim.ClaimName
pvc, err := pvcLister.PersistentVolumeClaims(namespace).Get(pvcName)
if err != nil {
// The error has already enough context ("persistentvolumeclaim "myclaim" not found")
return err
}
if pvc.DeletionTimestamp != nil {
return fmt.Errorf("persistentvolumeclaim %q is being deleted", pvc.Name)
}
}
return nil
}
5.2. findNodesThatFit
预选,通过预选函数来判断每个节点是否适合被该Pod调度。
具体的
findNodesThatFit代码实现细节待后续文章独立分析。
genericScheduler.Schedule中对findNodesThatFit的调用过程如下:
trace.Step("Computing predicates")
startPredicateEvalTime := time.Now()
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
if err != nil {
return "", err
}
if len(filteredNodes) == 0 {
return "", &FitError{
Pod: pod,
NumAllNodes: len(nodes),
FailedPredicates: failedPredicateMap,
}
}
metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))
5.3. PrioritizeNodes
优选,从满足的节点中选择出最优的节点。
具体操作如下:
- PrioritizeNodes通过并行运行各个优先级函数来对节点进行优先级排序。
- 每个优先级函数会给节点打分,打分范围为0-10分。
- 0 表示优先级最低的节点,10表示优先级最高的节点。
- 每个优先级函数也有各自的权重。
- 优先级函数返回的节点分数乘以权重以获得加权分数。
- 最后组合(添加)所有分数以获得所有节点的总加权分数。
具体
PrioritizeNodes的实现逻辑待后续文章独立分析。
genericScheduler.Schedule中对PrioritizeNodes的调用过程如下:
trace.Step("Prioritizing")
startPriorityEvalTime := time.Now()
// When only one node after predicate, just use it.
if len(filteredNodes) == 1 {
metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
return filteredNodes[0].Name, nil
}
metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
if err != nil {
return "", err
}
metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))
5.4. selectHost
scheduler在最后会从priorityList中选择分数最高的一个节点。
trace.Step("Selecting host")
return g.selectHost(priorityList)
selectHost获取优先级的节点列表,然后从分数最高的节点以循环方式选择一个节点。
具体代码如下:
// selectHost takes a prioritized list of nodes and then picks one
// in a round-robin manner from the nodes that had the highest score.
func (g *genericScheduler) selectHost(priorityList schedulerapi.HostPriorityList) (string, error) {
if len(priorityList) == 0 {
return "", fmt.Errorf("empty priorityList")
}
maxScores := findMaxScores(priorityList)
ix := int(g.lastNodeIndex % uint64(len(maxScores)))
g.lastNodeIndex++
return priorityList[maxScores[ix]].Host, nil
}
5.4.1. findMaxScores
findMaxScores返回priorityList中具有最高Score的节点的索引。
// findMaxScores returns the indexes of nodes in the "priorityList" that has the highest "Score".
func findMaxScores(priorityList schedulerapi.HostPriorityList) []int {
maxScoreIndexes := make([]int, 0, len(priorityList)/2)
maxScore := priorityList[0].Score
for i, hp := range priorityList {
if hp.Score > maxScore {
maxScore = hp.Score
maxScoreIndexes = maxScoreIndexes[:0]
maxScoreIndexes = append(maxScoreIndexes, i)
} else if hp.Score == maxScore {
maxScoreIndexes = append(maxScoreIndexes, i)
}
}
return maxScoreIndexes
}
6. Scheduler.preempt
如果pod在预选和优选调度中失败,则执行抢占操作。抢占主要是将低优先级的pod的资源空间腾出给待调度的高优先级的pod。
具体
Scheduler.preempt的实现逻辑待后续文章独立分析。
suggestedHost, err := sched.schedule(pod)
if err != nil {
// schedule() may have failed because the pod would not fit on any host, so we try to
// preempt, with the expectation that the next time the pod is tried for scheduling it
// will fit due to the preemption. It is also possible that a different pod will schedule
// into the resources that were preempted, but this is harmless.
if fitError, ok := err.(*core.FitError); ok {
preemptionStartTime := time.Now()
sched.preempt(pod, fitError)
metrics.PreemptionAttempts.Inc()
metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
}
return
}
7. Scheduler.assume
将该pod和选定的节点进行假性绑定,存入scheduler cache中,方便可以继续执行调度逻辑,而不需要等待绑定操作的发生,具体绑定操作可以异步进行。
// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
// This allows us to keep scheduling without waiting on binding to occur.
assumedPod := pod.DeepCopy()
// Assume volumes first before assuming the pod.
//
// If all volumes are completely bound, then allBound is true and binding will be skipped.
//
// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
//
// This function modifies 'assumedPod' if volume binding is required.
allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
if err != nil {
return
}
// assume modifies `assumedPod` by setting NodeName=suggestedHost
err = sched.assume(assumedPod, suggestedHost)
if err != nil {
return
}
如果假性绑定成功则发送请求给apiserver,如果失败则scheduler会立即释放已分配给假性绑定的pod的资源。
assume方法的具体实现:
// assume signals to the cache that a pod is already in the cache, so that binding can be asynchronous.
// assume modifies `assumed`.
func (sched *Scheduler) assume(assumed *v1.Pod, host string) error {
// Optimistically assume that the binding will succeed and send it to apiserver
// in the background.
// If the binding fails, scheduler will release resources allocated to assumed pod
// immediately.
assumed.Spec.NodeName = host
// NOTE: Because the scheduler uses snapshots of SchedulerCache and the live
// version of Ecache, updates must be written to SchedulerCache before
// invalidating Ecache.
if err := sched.config.SchedulerCache.AssumePod(assumed); err != nil {
glog.Errorf("scheduler cache AssumePod failed: %v", err)
// This is most probably result of a BUG in retrying logic.
// We report an error here so that pod scheduling can be retried.
// This relies on the fact that Error will check if the pod has been bound
// to a node and if so will not add it back to the unscheduled pods queue
// (otherwise this would cause an infinite loop).
sched.config.Error(assumed, err)
sched.config.Recorder.Eventf(assumed, v1.EventTypeWarning, "FailedScheduling", "AssumePod failed: %v", err)
sched.config.PodConditionUpdater.Update(assumed, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: "SchedulerError",
Message: err.Error(),
})
return err
}
// Optimistically assume that the binding will succeed, so we need to invalidate affected
// predicates in equivalence cache.
// If the binding fails, these invalidated item will not break anything.
if sched.config.Ecache != nil {
sched.config.Ecache.InvalidateCachedPredicateItemForPodAdd(assumed, host)
}
return nil
}
8. Scheduler.bind
异步的方式给pod绑定到具体的调度节点上。
// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
go func() {
// Bind volumes first before Pod
if !allBound {
err = sched.bindVolumes(assumedPod)
if err != nil {
return
}
}
err := sched.bind(assumedPod, &v1.Binding{
ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
Target: v1.ObjectReference{
Kind: "Node",
Name: suggestedHost,
},
})
metrics.E2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
if err != nil {
glog.Errorf("Internal error binding pod: (%v)", err)
}
}()
bind具体实现如下:
// bind binds a pod to a given node defined in a binding object. We expect this to run asynchronously, so we
// handle binding metrics internally.
func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error {
bindingStart := time.Now()
// If binding succeeded then PodScheduled condition will be updated in apiserver so that
// it's atomic with setting host.
err := sched.config.GetBinder(assumed).Bind(b)
if err := sched.config.SchedulerCache.FinishBinding(assumed); err != nil {
glog.Errorf("scheduler cache FinishBinding failed: %v", err)
}
if err != nil {
glog.V(1).Infof("Failed to bind pod: %v/%v", assumed.Namespace, assumed.Name)
if err := sched.config.SchedulerCache.ForgetPod(assumed); err != nil {
glog.Errorf("scheduler cache ForgetPod failed: %v", err)
}
sched.config.Error(assumed, err)
sched.config.Recorder.Eventf(assumed, v1.EventTypeWarning, "FailedScheduling", "Binding rejected: %v", err)
sched.config.PodConditionUpdater.Update(assumed, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: "BindingRejected",
})
return err
}
metrics.BindingLatency.Observe(metrics.SinceInMicroseconds(bindingStart))
metrics.SchedulingLatency.WithLabelValues(metrics.Binding).Observe(metrics.SinceInSeconds(bindingStart))
sched.config.Recorder.Eventf(assumed, v1.EventTypeNormal, "Scheduled", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, b.Target.Name)
return nil
}
9. 总结
本文主要分析了单个pod的调度过程。具体流程如下:
- 通过podQueue的待调度队列中弹出需要调度的pod。
- 通过具体的调度算法为该pod选出合适的节点,其中调度算法就包括预选和优选两步策略。
- 如果上述调度失败,则会尝试抢占机制,将优先级低的pod剔除,让优先级高的pod调度成功。
- 将该pod和选定的节点进行假性绑定,存入scheduler cache中,方便具体绑定操作可以异步进行。
- 实际执行绑定操作,将node的名字添加到pod的节点相关属性中。
其中核心的部分为通过具体的调度算法选出调度节点的过程,即genericScheduler.Schedule的实现部分。该部分包括预选和优选两个部分。
genericScheduler.Schedule调度的基本流程如下:
- 对pod做基本性检查,目前主要是对pvc的检查。
- 通过
findNodesThatFit预选策略选出满足调度条件的node列表。 - 通过
PrioritizeNodes优选策略给预选的node列表中的node进行打分。 - 在打分的node列表中选择一个分数最高的node作为调度的节点。
参考:
11.5 -
11.5.1 -
kubelet源码分析(一)之 NewKubeletCommand
以下代码分析基于
kubernetes v1.12.0版本。本文主要分析 https://github.com/kubernetes/kubernetes/tree/v1.12.0/cmd/kubelet 部分的代码。
本文主要分析 kubernetes/cmd/kubelet部分,该部分主要涉及kubelet的参数解析,及初始化和构造相关的依赖组件(主要在kubeDeps结构体中),并没有kubelet运行的详细逻辑,该部分位于kubernetes/pkg/kubelet模块,待后续文章分析。
kubelet的cmd代码目录结构如下:
kubelet
├── app
│ ├── auth.go
│ ├── init_others.go
│ ├── init_windows.go
│ ├── options # 包括kubelet使用到的option
│ │ ├── container_runtime.go
│ │ ├── globalflags.go
│ │ ├── globalflags_linux.go
│ │ ├── globalflags_other.go
│ │ ├── options.go # 包括KubeletFlags、AddFlags、AddKubeletConfigFlags等
│ │ ├── osflags_others.go
│ │ └── osflags_windows.go
│ ├── plugins.go
│ ├── server.go # 包括NewKubeletCommand、Run、RunKubelet、CreateAndInitKubelet、startKubelet等
│ ├── server_linux.go
│ └── server_unsupported.go
└── kubelet.go # kubelet的main入口函数
1. Main 函数
kubelet的入口函数 Main 函数,具体代码参考:https://github.com/kubernetes/kubernetes/blob/v1.12.0/cmd/kubelet/kubelet.go。
func main() {
rand.Seed(time.Now().UTC().UnixNano())
command := app.NewKubeletCommand(server.SetupSignalHandler())
logs.InitLogs()
defer logs.FlushLogs()
if err := command.Execute(); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
}
kubelet代码主要采用了Cobra命令行框架,核心代码如下:
// 初始化命令行
command := app.NewKubeletCommand(server.SetupSignalHandler())
// 执行Execute
err := command.Execute()
2. NewKubeletCommand
NewKubeletCommand基于参数创建了一个*cobra.Command对象。其中核心部分代码为参数解析部分和Run函数。
// NewKubeletCommand creates a *cobra.Command object with default parameters
func NewKubeletCommand(stopCh <-chan struct{}) *cobra.Command {
...
cmd := &cobra.Command{
Use: componentKubelet,
Long: `...`,
// The Kubelet has special flag parsing requirements to enforce flag precedence rules,
// so we do all our parsing manually in Run, below.
// DisableFlagParsing=true provides the full set of flags passed to the kubelet in the
// `args` arg to Run, without Cobra's interference.
DisableFlagParsing: true,
Run: func(cmd *cobra.Command, args []string) {
...
// run the kubelet
glog.V(5).Infof("KubeletConfiguration: %#v", kubeletServer.KubeletConfiguration)
if err := Run(kubeletServer, kubeletDeps, stopCh); err != nil {
glog.Fatal(err)
}
},
}
...
return cmd
}
2.1. 参数解析
kubelet开启了DisableFlagParsing参数,没有使用Cobra框架中的默认参数解析,而是自定义参数解析。
2.1.1. 初始化参数和配置
初始化参数解析,初始化cleanFlagSet,kubeletFlags,kubeletConfig。
cleanFlagSet := pflag.NewFlagSet(componentKubelet, pflag.ContinueOnError)
cleanFlagSet.SetNormalizeFunc(flag.WordSepNormalizeFunc)
kubeletFlags := options.NewKubeletFlags()
kubeletConfig, err := options.NewKubeletConfiguration()
2.1.2. 打印帮助信息和版本信息
如果输入非法参数则打印使用帮助信息。
// initial flag parse, since we disable cobra's flag parsing
if err := cleanFlagSet.Parse(args); err != nil {
cmd.Usage()
glog.Fatal(err)
}
// check if there are non-flag arguments in the command line
cmds := cleanFlagSet.Args()
if len(cmds) > 0 {
cmd.Usage()
glog.Fatalf("unknown command: %s", cmds[0])
}
遇到help和version参数则打印相关内容并退出。
// short-circuit on help
help, err := cleanFlagSet.GetBool("help")
if err != nil {
glog.Fatal(`"help" flag is non-bool, programmer error, please correct`)
}
if help {
cmd.Help()
return
}
// short-circuit on verflag
verflag.PrintAndExitIfRequested()
utilflag.PrintFlags(cleanFlagSet)
2.1.3. kubelet config
加载并校验kubelet config。其中包括校验初始化的kubeletFlags,并从kubeletFlags的KubeletConfigFile参数获取kubelet config的内容。
// set feature gates from initial flags-based config
if err := utilfeature.DefaultFeatureGate.SetFromMap(kubeletConfig.FeatureGates); err != nil {
glog.Fatal(err)
}
// validate the initial KubeletFlags
if err := options.ValidateKubeletFlags(kubeletFlags); err != nil {
glog.Fatal(err)
}
if kubeletFlags.ContainerRuntime == "remote" && cleanFlagSet.Changed("pod-infra-container-image") {
glog.Warning("Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead")
}
// load kubelet config file, if provided
if configFile := kubeletFlags.KubeletConfigFile; len(configFile) > 0 {
kubeletConfig, err = loadConfigFile(configFile)
if err != nil {
glog.Fatal(err)
}
// We must enforce flag precedence by re-parsing the command line into the new object.
// This is necessary to preserve backwards-compatibility across binary upgrades.
// See issue #56171 for more details.
if err := kubeletConfigFlagPrecedence(kubeletConfig, args); err != nil {
glog.Fatal(err)
}
// update feature gates based on new config
if err := utilfeature.DefaultFeatureGate.SetFromMap(kubeletConfig.FeatureGates); err != nil {
glog.Fatal(err)
}
}
// We always validate the local configuration (command line + config file).
// This is the default "last-known-good" config for dynamic config, and must always remain valid.
if err := kubeletconfigvalidation.ValidateKubeletConfiguration(kubeletConfig); err != nil {
glog.Fatal(err)
}
2.1.4. dynamic kubelet config
如果开启使用动态kubelet的配置,则由动态配置文件替换kubelet配置文件。
// use dynamic kubelet config, if enabled
var kubeletConfigController *dynamickubeletconfig.Controller
if dynamicConfigDir := kubeletFlags.DynamicConfigDir.Value(); len(dynamicConfigDir) > 0 {
var dynamicKubeletConfig *kubeletconfiginternal.KubeletConfiguration
dynamicKubeletConfig, kubeletConfigController, err = BootstrapKubeletConfigController(dynamicConfigDir,
func(kc *kubeletconfiginternal.KubeletConfiguration) error {
// Here, we enforce flag precedence inside the controller, prior to the controller's validation sequence,
// so that we get a complete validation at the same point where we can decide to reject dynamic config.
// This fixes the flag-precedence component of issue #63305.
// See issue #56171 for general details on flag precedence.
return kubeletConfigFlagPrecedence(kc, args)
})
if err != nil {
glog.Fatal(err)
}
// If we should just use our existing, local config, the controller will return a nil config
if dynamicKubeletConfig != nil {
kubeletConfig = dynamicKubeletConfig
// Note: flag precedence was already enforced in the controller, prior to validation,
// by our above transform function. Now we simply update feature gates from the new config.
if err := utilfeature.DefaultFeatureGate.SetFromMap(kubeletConfig.FeatureGates); err != nil {
glog.Fatal(err)
}
}
}
总结:以上通过对各种特定参数的解析,最终生成kubeletFlags和kubeletConfig两个重要的参数对象,用来构造kubeletServer和其他需求。
2.2. 初始化kubeletServer和kubeletDeps
2.2.1. kubeletServer
// construct a KubeletServer from kubeletFlags and kubeletConfig
kubeletServer := &options.KubeletServer{
KubeletFlags: *kubeletFlags,
KubeletConfiguration: *kubeletConfig,
}
2.2.2. kubeletDeps
// use kubeletServer to construct the default KubeletDeps
kubeletDeps, err := UnsecuredDependencies(kubeletServer)
if err != nil {
glog.Fatal(err)
}
// add the kubelet config controller to kubeletDeps
kubeletDeps.KubeletConfigController = kubeletConfigController
2.2.3. docker shim
如果开启了docker shim参数,则执行RunDockershim。
// start the experimental docker shim, if enabled
if kubeletServer.KubeletFlags.ExperimentalDockershim {
if err := RunDockershim(&kubeletServer.KubeletFlags, kubeletConfig, stopCh); err != nil {
glog.Fatal(err)
}
return
}
2.3. AddFlags
// keep cleanFlagSet separate, so Cobra doesn't pollute it with the global flags
kubeletFlags.AddFlags(cleanFlagSet)
options.AddKubeletConfigFlags(cleanFlagSet, kubeletConfig)
options.AddGlobalFlags(cleanFlagSet)
cleanFlagSet.BoolP("help", "h", false, fmt.Sprintf("help for %s", cmd.Name()))
// ugly, but necessary, because Cobra's default UsageFunc and HelpFunc pollute the flagset with global flags
const usageFmt = "Usage:\n %s\n\nFlags:\n%s"
cmd.SetUsageFunc(func(cmd *cobra.Command) error {
fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine(), cleanFlagSet.FlagUsagesWrapped(2))
return nil
})
cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine(), cleanFlagSet.FlagUsagesWrapped(2))
})
其中:
AddFlags代码可参考:kubernetes/cmd/kubelet/app/options/options.go#L323AddKubeletConfigFlags代码可参考:kubernetes/cmd/kubelet/app/options/options.go#L424
2.4. 运行kubelet
运行kubelet并且不退出。由Run函数进入后续的操作。
// run the kubelet
glog.V(5).Infof("KubeletConfiguration: %#v", kubeletServer.KubeletConfiguration)
if err := Run(kubeletServer, kubeletDeps, stopCh); err != nil {
glog.Fatal(err)
}
3. Run
// Run runs the specified KubeletServer with the given Dependencies. This should never exit.
// The kubeDeps argument may be nil - if so, it is initialized from the settings on KubeletServer.
// Otherwise, the caller is assumed to have set up the Dependencies object and a default one will
// not be generated.
func Run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies, stopCh <-chan struct{}) error {
// To help debugging, immediately log version
glog.Infof("Version: %+v", version.Get())
if err := initForOS(s.KubeletFlags.WindowsService); err != nil {
return fmt.Errorf("failed OS init: %v", err)
}
if err := run(s, kubeDeps, stopCh); err != nil {
return fmt.Errorf("failed to run Kubelet: %v", err)
}
return nil
}
当运行环境是Windows的时候,初始化操作,但是该操作为空,只是预留。具体执行run(s, kubeDeps, stopCh)函数。
3.1. 构造kubeDeps
3.1.1. clientConfig
创建clientConfig,该对象用来创建各种的kubeDeps属性中包含的client。
clientConfig, err := createAPIServerClientConfig(s)
if err != nil {
return fmt.Errorf("invalid kubeconfig: %v", err)
}
3.1.2. kubeClient
kubeClient, err = clientset.NewForConfig(clientConfig)
if err != nil {
glog.Warningf("New kubeClient from clientConfig error: %v", err)
} else if kubeClient.CertificatesV1beta1() != nil && clientCertificateManager != nil {
glog.V(2).Info("Starting client certificate rotation.")
clientCertificateManager.SetCertificateSigningRequestClient(kubeClient.CertificatesV1beta1().CertificateSigningRequests())
clientCertificateManager.Start()
}
3.1.3. dynamicKubeClient
dynamicKubeClient, err = dynamic.NewForConfig(clientConfig)
if err != nil {
glog.Warningf("Failed to initialize dynamic KubeClient: %v", err)
}
3.1.4. eventClient
// make a separate client for events
eventClientConfig := *clientConfig
eventClientConfig.QPS = float32(s.EventRecordQPS)
eventClientConfig.Burst = int(s.EventBurst)
eventClient, err = v1core.NewForConfig(&eventClientConfig)
if err != nil {
glog.Warningf("Failed to create API Server client for Events: %v", err)
}
3.1.5. heartbeatClient
// make a separate client for heartbeat with throttling disabled and a timeout attached
heartbeatClientConfig := *clientConfig
heartbeatClientConfig.Timeout = s.KubeletConfiguration.NodeStatusUpdateFrequency.Duration
// if the NodeLease feature is enabled, the timeout is the minimum of the lease duration and status update frequency
if utilfeature.DefaultFeatureGate.Enabled(features.NodeLease) {
leaseTimeout := time.Duration(s.KubeletConfiguration.NodeLeaseDurationSeconds) * time.Second
if heartbeatClientConfig.Timeout > leaseTimeout {
heartbeatClientConfig.Timeout = leaseTimeout
}
}
heartbeatClientConfig.QPS = float32(-1)
heartbeatClient, err = clientset.NewForConfig(&heartbeatClientConfig)
if err != nil {
glog.Warningf("Failed to create API Server client for heartbeat: %v", err)
}
3.1.6. csiClient
// csiClient works with CRDs that support json only
clientConfig.ContentType = "application/json"
csiClient, err := csiclientset.NewForConfig(clientConfig)
if err != nil {
glog.Warningf("Failed to create CSI API client: %v", err)
}
client赋值
kubeDeps.KubeClient = kubeClient
kubeDeps.DynamicKubeClient = dynamicKubeClient
if heartbeatClient != nil {
kubeDeps.HeartbeatClient = heartbeatClient
kubeDeps.OnHeartbeatFailure = closeAllConns
}
if eventClient != nil {
kubeDeps.EventClient = eventClient
}
kubeDeps.CSIClient = csiClient
3.1.7. CAdvisorInterface
if kubeDeps.CAdvisorInterface == nil {
imageFsInfoProvider := cadvisor.NewImageFsInfoProvider(s.ContainerRuntime, s.RemoteRuntimeEndpoint)
kubeDeps.CAdvisorInterface, err = cadvisor.New(imageFsInfoProvider, s.RootDirectory, cadvisor.UsingLegacyCadvisorStats(s.ContainerRuntime, s.RemoteRuntimeEndpoint))
if err != nil {
return err
}
}
3.1.8. ContainerManager
if kubeDeps.ContainerManager == nil {
if s.CgroupsPerQOS && s.CgroupRoot == "" {
glog.Infof("--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /")
s.CgroupRoot = "/"
}
kubeReserved, err := parseResourceList(s.KubeReserved)
if err != nil {
return err
}
systemReserved, err := parseResourceList(s.SystemReserved)
if err != nil {
return err
}
var hardEvictionThresholds []evictionapi.Threshold
// If the user requested to ignore eviction thresholds, then do not set valid values for hardEvictionThresholds here.
if !s.ExperimentalNodeAllocatableIgnoreEvictionThreshold {
hardEvictionThresholds, err = eviction.ParseThresholdConfig([]string{}, s.EvictionHard, nil, nil, nil)
if err != nil {
return err
}
}
experimentalQOSReserved, err := cm.ParseQOSReserved(s.QOSReserved)
if err != nil {
return err
}
devicePluginEnabled := utilfeature.DefaultFeatureGate.Enabled(features.DevicePlugins)
kubeDeps.ContainerManager, err = cm.NewContainerManager(
kubeDeps.Mounter,
kubeDeps.CAdvisorInterface,
cm.NodeConfig{
RuntimeCgroupsName: s.RuntimeCgroups,
SystemCgroupsName: s.SystemCgroups,
KubeletCgroupsName: s.KubeletCgroups,
ContainerRuntime: s.ContainerRuntime,
CgroupsPerQOS: s.CgroupsPerQOS,
CgroupRoot: s.CgroupRoot,
CgroupDriver: s.CgroupDriver,
KubeletRootDir: s.RootDirectory,
ProtectKernelDefaults: s.ProtectKernelDefaults,
NodeAllocatableConfig: cm.NodeAllocatableConfig{
KubeReservedCgroupName: s.KubeReservedCgroup,
SystemReservedCgroupName: s.SystemReservedCgroup,
EnforceNodeAllocatable: sets.NewString(s.EnforceNodeAllocatable...),
KubeReserved: kubeReserved,
SystemReserved: systemReserved,
HardEvictionThresholds: hardEvictionThresholds,
},
QOSReserved: *experimentalQOSReserved,
ExperimentalCPUManagerPolicy: s.CPUManagerPolicy,
ExperimentalCPUManagerReconcilePeriod: s.CPUManagerReconcilePeriod.Duration,
ExperimentalPodPidsLimit: s.PodPidsLimit,
EnforceCPULimits: s.CPUCFSQuota,
CPUCFSQuotaPeriod: s.CPUCFSQuotaPeriod.Duration,
},
s.FailSwapOn,
devicePluginEnabled,
kubeDeps.Recorder)
if err != nil {
return err
}
}
3.1.9. oomAdjuster
// TODO(vmarmol): Do this through container config.
oomAdjuster := kubeDeps.OOMAdjuster
if err := oomAdjuster.ApplyOOMScoreAdj(0, int(s.OOMScoreAdj)); err != nil {
glog.Warning(err)
}
3.2. Health check
if s.HealthzPort > 0 {
healthz.DefaultHealthz()
go wait.Until(func() {
err := http.ListenAndServe(net.JoinHostPort(s.HealthzBindAddress, strconv.Itoa(int(s.HealthzPort))), nil)
if err != nil {
glog.Errorf("Starting health server failed: %v", err)
}
}, 5*time.Second, wait.NeverStop)
}
3.3. RunKubelet
通过各种赋值构造了完整的kubeDeps结构体,最后再执行RunKubelet转入后续的kubelet执行流程。
if err := RunKubelet(s, kubeDeps, s.RunOnce); err != nil {
return err
}
4. RunKubelet
// RunKubelet is responsible for setting up and running a kubelet. It is used in three different applications:
// 1 Integration tests
// 2 Kubelet binary
// 3 Standalone 'kubernetes' binary
// Eventually, #2 will be replaced with instances of #3
func RunKubelet(kubeServer *options.KubeletServer, kubeDeps *kubelet.Dependencies, runOnce bool) error {
...
k, err := CreateAndInitKubelet(&kubeServer.KubeletConfiguration,
...
kubeServer.NodeStatusMaxImages)
if err != nil {
return fmt.Errorf("failed to create kubelet: %v", err)
}
// NewMainKubelet should have set up a pod source config if one didn't exist
// when the builder was run. This is just a precaution.
if kubeDeps.PodConfig == nil {
return fmt.Errorf("failed to create kubelet, pod source config was nil")
}
podCfg := kubeDeps.PodConfig
rlimit.RlimitNumFiles(uint64(kubeServer.MaxOpenFiles))
// process pods and exit.
if runOnce {
if _, err := k.RunOnce(podCfg.Updates()); err != nil {
return fmt.Errorf("runonce failed: %v", err)
}
glog.Infof("Started kubelet as runonce")
} else {
startKubelet(k, podCfg, &kubeServer.KubeletConfiguration, kubeDeps, kubeServer.EnableServer)
glog.Infof("Started kubelet")
}
return nil
}
RunKubelet函数核心代码为执行了CreateAndInitKubelet和startKubelet两个函数的操作,以下对这两个函数进行分析。
4.1. CreateAndInitKubelet
通过传入kubeDeps调用CreateAndInitKubelet初始化Kubelet。
k, err := CreateAndInitKubelet(&kubeServer.KubeletConfiguration,
kubeDeps,
&kubeServer.ContainerRuntimeOptions,
kubeServer.ContainerRuntime,
kubeServer.RuntimeCgroups,
kubeServer.HostnameOverride,
kubeServer.NodeIP,
kubeServer.ProviderID,
kubeServer.CloudProvider,
kubeServer.CertDirectory,
kubeServer.RootDirectory,
kubeServer.RegisterNode,
kubeServer.RegisterWithTaints,
kubeServer.AllowedUnsafeSysctls,
kubeServer.RemoteRuntimeEndpoint,
kubeServer.RemoteImageEndpoint,
kubeServer.ExperimentalMounterPath,
kubeServer.ExperimentalKernelMemcgNotification,
kubeServer.ExperimentalCheckNodeCapabilitiesBeforeMount,
kubeServer.ExperimentalNodeAllocatableIgnoreEvictionThreshold,
kubeServer.MinimumGCAge,
kubeServer.MaxPerPodContainerCount,
kubeServer.MaxContainerCount,
kubeServer.MasterServiceNamespace,
kubeServer.RegisterSchedulable,
kubeServer.NonMasqueradeCIDR,
kubeServer.KeepTerminatedPodVolumes,
kubeServer.NodeLabels,
kubeServer.SeccompProfileRoot,
kubeServer.BootstrapCheckpointPath,
kubeServer.NodeStatusMaxImages)
if err != nil {
return fmt.Errorf("failed to create kubelet: %v", err)
}
4.1.1. NewMainKubelet
CreateAndInitKubelet方法中执行的核心函数是NewMainKubelet,NewMainKubelet实例化一个kubelet对象,该部分的具体代码在kubernetes/pkg/kubelet中,具体参考:kubernetes/pkg/kubelet/kubelet.go#L325。
func CreateAndInitKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
...
nodeStatusMaxImages int32) (k kubelet.Bootstrap, err error) {
// TODO: block until all sources have delivered at least one update to the channel, or break the sync loop
// up into "per source" synchronizations
k, err = kubelet.NewMainKubelet(kubeCfg,
kubeDeps,
crOptions,
containerRuntime,
runtimeCgroups,
hostnameOverride,
nodeIP,
providerID,
cloudProvider,
certDirectory,
rootDirectory,
registerNode,
registerWithTaints,
allowedUnsafeSysctls,
remoteRuntimeEndpoint,
remoteImageEndpoint,
experimentalMounterPath,
experimentalKernelMemcgNotification,
experimentalCheckNodeCapabilitiesBeforeMount,
experimentalNodeAllocatableIgnoreEvictionThreshold,
minimumGCAge,
maxPerPodContainerCount,
maxContainerCount,
masterServiceNamespace,
registerSchedulable,
nonMasqueradeCIDR,
keepTerminatedPodVolumes,
nodeLabels,
seccompProfileRoot,
bootstrapCheckpointPath,
nodeStatusMaxImages)
if err != nil {
return nil, err
}
k.BirthCry()
k.StartGarbageCollection()
return k, nil
}
4.1.2. PodConfig
if kubeDeps.PodConfig == nil {
var err error
kubeDeps.PodConfig, err = makePodSourceConfig(kubeCfg, kubeDeps, nodeName, bootstrapCheckpointPath)
if err != nil {
return nil, err
}
}
NewMainKubelet-->PodConfig-->NewPodConfig-->kubetypes.PodUpdate。会生成一个podUpdate的channel来监听pod的变化,该channel会在k.Run(podCfg.Updates())中作为关键入参。
4.2. startKubelet
// process pods and exit.
if runOnce {
if _, err := k.RunOnce(podCfg.Updates()); err != nil {
return fmt.Errorf("runonce failed: %v", err)
}
glog.Infof("Started kubelet as runonce")
} else {
startKubelet(k, podCfg, &kubeServer.KubeletConfiguration, kubeDeps, kubeServer.EnableServer)
glog.Infof("Started kubelet")
}
如果设置了只运行一次的参数,则执行k.RunOnce,否则执行核心函数startKubelet。具体实现如下:
func startKubelet(k kubelet.Bootstrap, podCfg *config.PodConfig, kubeCfg *kubeletconfiginternal.KubeletConfiguration, kubeDeps *kubelet.Dependencies, enableServer bool) {
// start the kubelet
go wait.Until(func() {
k.Run(podCfg.Updates())
}, 0, wait.NeverStop)
// start the kubelet server
if enableServer {
go k.ListenAndServe(net.ParseIP(kubeCfg.Address), uint(kubeCfg.Port), kubeDeps.TLSOptions, kubeDeps.Auth, kubeCfg.EnableDebuggingHandlers, kubeCfg.EnableContentionProfiling)
}
if kubeCfg.ReadOnlyPort > 0 {
go k.ListenAndServeReadOnly(net.ParseIP(kubeCfg.Address), uint(kubeCfg.ReadOnlyPort))
}
}
4.2.1. k.Run
// start the kubelet
go wait.Until(func() {
k.Run(podCfg.Updates())
}, 0, wait.NeverStop)
通过长驻进程的方式运行k.Run,不退出,将kubelet的运行逻辑引入kubernetes/pkg/kubelet/kubelet.go部分,kubernetes/pkg/kubelet部分的运行逻辑待后续文章分析。
5. 总结
-
kubelet采用Cobra命令行框架和pflag参数解析框架,和apiserver、scheduler、controller-manager形成统一的代码风格。
-
kubernetes/cmd/kubelet部分主要对运行参数进行定义和解析,初始化和构造相关的依赖组件(主要在kubeDeps结构体中),并没有kubelet运行的详细逻辑,该部分位于kubernetes/pkg/kubelet模块。 -
cmd部分调用流程如下:
Main-->NewKubeletCommand-->Run(kubeletServer, kubeletDeps, stopCh)-->run(s *options.KubeletServer, kubeDeps ..., stopCh ...)--> RunKubelet(s, kubeDeps, s.RunOnce)-->startKubelet-->k.Run(podCfg.Updates())-->pkg/kubelet。同时
RunKubelet(s, kubeDeps, s.RunOnce)-->CreateAndInitKubelet-->kubelet.NewMainKubelet-->pkg/kubelet。
参考文章:
11.5.2 -
kubelet源码分析(二)之 NewMainKubelet
以下代码分析基于
kubernetes v1.12.0版本。本文主要分析 https://github.com/kubernetes/kubernetes/tree/v1.12.0/pkg/kubelet 部分的代码。
本文主要分析kubelet中的NewMainKubelet部分。
1. NewMainKubelet
NewMainKubelet主要用来初始化和构造一个kubelet结构体,kubelet结构体定义参考:https://github.com/kubernetes/kubernetes/blob/v1.12.0/pkg/kubelet/kubelet.go#L888
// NewMainKubelet instantiates a new Kubelet object along with all the required internal modules.
// No initialization of Kubelet and its modules should happen here.
func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
kubeDeps *Dependencies,
crOptions *config.ContainerRuntimeOptions,
containerRuntime string,
runtimeCgroups string,
hostnameOverride string,
nodeIP string,
providerID string,
cloudProvider string,
certDirectory string,
rootDirectory string,
registerNode bool,
registerWithTaints []api.Taint,
allowedUnsafeSysctls []string,
remoteRuntimeEndpoint string,
remoteImageEndpoint string,
experimentalMounterPath string,
experimentalKernelMemcgNotification bool,
experimentalCheckNodeCapabilitiesBeforeMount bool,
experimentalNodeAllocatableIgnoreEvictionThreshold bool,
minimumGCAge metav1.Duration,
maxPerPodContainerCount int32,
maxContainerCount int32,
masterServiceNamespace string,
registerSchedulable bool,
nonMasqueradeCIDR string,
keepTerminatedPodVolumes bool,
nodeLabels map[string]string,
seccompProfileRoot string,
bootstrapCheckpointPath string,
nodeStatusMaxImages int32) (*Kubelet, error) {
...
}
1.1. PodConfig
通过makePodSourceConfig生成Pod config。
if kubeDeps.PodConfig == nil {
var err error
kubeDeps.PodConfig, err = makePodSourceConfig(kubeCfg, kubeDeps, nodeName, bootstrapCheckpointPath)
if err != nil {
return nil, err
}
}
1.1.1. makePodSourceConfig
// makePodSourceConfig creates a config.PodConfig from the given
// KubeletConfiguration or returns an error.
func makePodSourceConfig(kubeCfg *kubeletconfiginternal.KubeletConfiguration, kubeDeps *Dependencies, nodeName types.NodeName, bootstrapCheckpointPath string) (*config.PodConfig, error) {
...
// source of all configuration
cfg := config.NewPodConfig(config.PodConfigNotificationIncremental, kubeDeps.Recorder)
// define file config source
if kubeCfg.StaticPodPath != "" {
glog.Infof("Adding pod path: %v", kubeCfg.StaticPodPath)
config.NewSourceFile(kubeCfg.StaticPodPath, nodeName, kubeCfg.FileCheckFrequency.Duration, cfg.Channel(kubetypes.FileSource))
}
// define url config source
if kubeCfg.StaticPodURL != "" {
glog.Infof("Adding pod url %q with HTTP header %v", kubeCfg.StaticPodURL, manifestURLHeader)
config.NewSourceURL(kubeCfg.StaticPodURL, manifestURLHeader, nodeName, kubeCfg.HTTPCheckFrequency.Duration, cfg.Channel(kubetypes.HTTPSource))
}
// Restore from the checkpoint path
// NOTE: This MUST happen before creating the apiserver source
// below, or the checkpoint would override the source of truth.
...
if kubeDeps.KubeClient != nil {
glog.Infof("Watching apiserver")
if updatechannel == nil {
updatechannel = cfg.Channel(kubetypes.ApiserverSource)
}
config.NewSourceApiserver(kubeDeps.KubeClient, nodeName, updatechannel)
}
return cfg, nil
}
1.1.2. NewPodConfig
// NewPodConfig creates an object that can merge many configuration sources into a stream
// of normalized updates to a pod configuration.
func NewPodConfig(mode PodConfigNotificationMode, recorder record.EventRecorder) *PodConfig {
updates := make(chan kubetypes.PodUpdate, 50)
storage := newPodStorage(updates, mode, recorder)
podConfig := &PodConfig{
pods: storage,
mux: config.NewMux(storage),
updates: updates,
sources: sets.String{},
}
return podConfig
}
1.1.3. NewSourceApiserver
// NewSourceApiserver creates a config source that watches and pulls from the apiserver.
func NewSourceApiserver(c clientset.Interface, nodeName types.NodeName, updates chan<- interface{}) {
lw := cache.NewListWatchFromClient(c.CoreV1().RESTClient(), "pods", metav1.NamespaceAll, fields.OneTermEqualSelector(api.PodHostField, string(nodeName)))
newSourceApiserverFromLW(lw, updates)
}
1.2. Lister
serviceLister和nodeLister分别通过List-Watch机制监听service和node的列表变化。
1.2.1. serviceLister
serviceIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc})
if kubeDeps.KubeClient != nil {
serviceLW := cache.NewListWatchFromClient(kubeDeps.KubeClient.CoreV1().RESTClient(), "services", metav1.NamespaceAll, fields.Everything())
r := cache.NewReflector(serviceLW, &v1.Service{}, serviceIndexer, 0)
go r.Run(wait.NeverStop)
}
serviceLister := corelisters.NewServiceLister(serviceIndexer)
1.2.2. nodeLister
nodeIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{})
if kubeDeps.KubeClient != nil {
fieldSelector := fields.Set{api.ObjectNameField: string(nodeName)}.AsSelector()
nodeLW := cache.NewListWatchFromClient(kubeDeps.KubeClient.CoreV1().RESTClient(), "nodes", metav1.NamespaceAll, fieldSelector)
r := cache.NewReflector(nodeLW, &v1.Node{}, nodeIndexer, 0)
go r.Run(wait.NeverStop)
}
nodeInfo := &predicates.CachedNodeInfo{NodeLister: corelisters.NewNodeLister(nodeIndexer)}
1.3. 各种Manager
1.3.1. containerRefManager
containerRefManager := kubecontainer.NewRefManager()
1.3.2. oomWatcher
oomWatcher := NewOOMWatcher(kubeDeps.CAdvisorInterface, kubeDeps.Recorder)
1.3.3. dnsConfigurer
clusterDNS := make([]net.IP, 0, len(kubeCfg.ClusterDNS))
for _, ipEntry := range kubeCfg.ClusterDNS {
ip := net.ParseIP(ipEntry)
if ip == nil {
glog.Warningf("Invalid clusterDNS ip '%q'", ipEntry)
} else {
clusterDNS = append(clusterDNS, ip)
}
}
...
dns.NewConfigurer(kubeDeps.Recorder, nodeRef, parsedNodeIP, clusterDNS, kubeCfg.ClusterDomain, kubeCfg.ResolverConfig),
1.3.4. secretManager & configMapManager
var secretManager secret.Manager
var configMapManager configmap.Manager
switch kubeCfg.ConfigMapAndSecretChangeDetectionStrategy {
case kubeletconfiginternal.WatchChangeDetectionStrategy:
secretManager = secret.NewWatchingSecretManager(kubeDeps.KubeClient)
configMapManager = configmap.NewWatchingConfigMapManager(kubeDeps.KubeClient)
case kubeletconfiginternal.TTLCacheChangeDetectionStrategy:
secretManager = secret.NewCachingSecretManager(
kubeDeps.KubeClient, manager.GetObjectTTLFromNodeFunc(klet.GetNode))
configMapManager = configmap.NewCachingConfigMapManager(
kubeDeps.KubeClient, manager.GetObjectTTLFromNodeFunc(klet.GetNode))
case kubeletconfiginternal.GetChangeDetectionStrategy:
secretManager = secret.NewSimpleSecretManager(kubeDeps.KubeClient)
configMapManager = configmap.NewSimpleConfigMapManager(kubeDeps.KubeClient)
default:
return nil, fmt.Errorf("unknown configmap and secret manager mode: %v", kubeCfg.ConfigMapAndSecretChangeDetectionStrategy)
}
klet.secretManager = secretManager
klet.configMapManager = configMapManager
1.3.5. livenessManager
klet.livenessManager = proberesults.NewManager()
1.3.6. podManager
// podManager is also responsible for keeping secretManager and configMapManager contents up-to-date.
klet.podManager = kubepod.NewBasicPodManager(kubepod.NewBasicMirrorClient(klet.kubeClient), secretManager, configMapManager, checkpointManager)
1.3.7. resourceAnalyzer
klet.resourceAnalyzer = serverstats.NewResourceAnalyzer(klet, kubeCfg.VolumeStatsAggPeriod.Duration)
1.3.8. containerGC
// setup containerGC
containerGC, err := kubecontainer.NewContainerGC(klet.containerRuntime, containerGCPolicy, klet.sourcesReady)
if err != nil {
return nil, err
}
klet.containerGC = containerGC
klet.containerDeletor = newPodContainerDeletor(klet.containerRuntime, integer.IntMax(containerGCPolicy.MaxPerPodContainer, minDeadContainerInPod))
1.3.9. imageManager
// setup imageManager
imageManager, err := images.NewImageGCManager(klet.containerRuntime, klet.StatsProvider, kubeDeps.Recorder, nodeRef, imageGCPolicy, crOptions.PodSandboxImage)
if err != nil {
return nil, fmt.Errorf("failed to initialize image manager: %v", err)
}
klet.imageManager = imageManager
1.3.10. statusManager
klet.statusManager = status.NewManager(klet.kubeClient, klet.podManager, klet)
1.3.11. probeManager
klet.probeManager = prober.NewManager(
klet.statusManager,
klet.livenessManager,
klet.runner,
containerRefManager,
kubeDeps.Recorder)
1.3.12. tokenManager
tokenManager := token.NewManager(kubeDeps.KubeClient)
1.3.13. volumePluginMgr
klet.volumePluginMgr, err =
NewInitializedVolumePluginMgr(klet, secretManager, configMapManager, tokenManager, kubeDeps.VolumePlugins, kubeDeps.DynamicPluginProber)
if err != nil {
return nil, err
}
if klet.enablePluginsWatcher {
klet.pluginWatcher = pluginwatcher.NewWatcher(klet.getPluginsDir())
}
1.3.14. volumeManager
// setup volumeManager
klet.volumeManager = volumemanager.NewVolumeManager(
kubeCfg.EnableControllerAttachDetach,
nodeName,
klet.podManager,
klet.statusManager,
klet.kubeClient,
klet.volumePluginMgr,
klet.containerRuntime,
kubeDeps.Mounter,
klet.getPodsDir(),
kubeDeps.Recorder,
experimentalCheckNodeCapabilitiesBeforeMount,
keepTerminatedPodVolumes)
1.3.15. evictionManager
// setup eviction manager
evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock)
klet.evictionManager = evictionManager
klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)
1.4. containerRuntime
目前pod所使用的runtime只有docker和remote两种,rkt已经废弃。
if containerRuntime == "rkt" {
glog.Fatalln("rktnetes has been deprecated in favor of rktlet. Please see https://github.com/kubernetes-incubator/rktlet for more information.")
}
当runtime是docker的时候,会执行docker相关操作。
switch containerRuntime {
case kubetypes.DockerContainerRuntime:
// Create and start the CRI shim running as a grpc server.
...
// The unix socket for kubelet <-> dockershim communication.
...
// Create dockerLegacyService when the logging driver is not supported.
...
case kubetypes.RemoteContainerRuntime:
// No-op.
break
default:
return nil, fmt.Errorf("unsupported CRI runtime: %q", containerRuntime)
}
1.4.1. NewDockerService
// Create and start the CRI shim running as a grpc server.
streamingConfig := getStreamingConfig(kubeCfg, kubeDeps, crOptions)
ds, err := dockershim.NewDockerService(kubeDeps.DockerClientConfig, crOptions.PodSandboxImage, streamingConfig,
&pluginSettings, runtimeCgroups, kubeCfg.CgroupDriver, crOptions.DockershimRootDirectory, !crOptions.RedirectContainerStreaming)
if err != nil {
return nil, err
}
if crOptions.RedirectContainerStreaming {
klet.criHandler = ds
}
1.4.2. NewDockerServer
// The unix socket for kubelet <-> dockershim communication.
glog.V(5).Infof("RemoteRuntimeEndpoint: %q, RemoteImageEndpoint: %q",
remoteRuntimeEndpoint,
remoteImageEndpoint)
glog.V(2).Infof("Starting the GRPC server for the docker CRI shim.")
server := dockerremote.NewDockerServer(remoteRuntimeEndpoint, ds)
if err := server.Start(); err != nil {
return nil, err
}
1.4.3. DockerServer.Start
// Start starts the dockershim grpc server.
func (s *DockerServer) Start() error {
// Start the internal service.
if err := s.service.Start(); err != nil {
glog.Errorf("Unable to start docker service")
return err
}
glog.V(2).Infof("Start dockershim grpc server")
l, err := util.CreateListener(s.endpoint)
if err != nil {
return fmt.Errorf("failed to listen on %q: %v", s.endpoint, err)
}
// Create the grpc server and register runtime and image services.
s.server = grpc.NewServer(
grpc.MaxRecvMsgSize(maxMsgSize),
grpc.MaxSendMsgSize(maxMsgSize),
)
runtimeapi.RegisterRuntimeServiceServer(s.server, s.service)
runtimeapi.RegisterImageServiceServer(s.server, s.service)
go func() {
if err := s.server.Serve(l); err != nil {
glog.Fatalf("Failed to serve connections: %v", err)
}
}()
return nil
}
1.5. podWorker
构造podWorkers和workQueue。
klet.workQueue = queue.NewBasicWorkQueue(klet.clock)
klet.podWorkers = newPodWorkers(klet.syncPod, kubeDeps.Recorder, klet.workQueue, klet.resyncInterval, backOffPeriod, klet.podCache)
1.5.1. PodWorkers接口
// PodWorkers is an abstract interface for testability.
type PodWorkers interface {
UpdatePod(options *UpdatePodOptions)
ForgetNonExistingPodWorkers(desiredPods map[types.UID]empty)
ForgetWorker(uid types.UID)
}
podWorker主要用来对pod相应事件进行处理和同步,包含以下三个方法:UpdatePod、ForgetNonExistingPodWorkers、ForgetWorker。
2. 总结
-
NewMainKubelet主要用来构造kubelet结构体,其中kubelet除了包含必要的配置和client(例如:kubeClient、csiClient等)外,最主要的包含各种manager来管理不同的任务。 -
核心的manager有以下几种:
oomWatcher:监控pod内存是否发生OOM。podManager:管理pod的生命周期,包括对pod的增删改查操作等。containerGC:对死亡容器进行垃圾回收。imageManager:对容器镜像进行垃圾回收。statusManager:与apiserver同步pod状态,同时也作状态缓存。volumeManager:对pod的volume进行attached/detached/mounted/unmounted操作。evictionManager:保证节点稳定,必要时对pod进行驱逐(例如资源不足的情况下)。
-
NewMainKubelet还包含了serviceLister和nodeLister来监听service和node的列表变化。 -
kubelet使用到的
containerRuntime目前主要是docker,其中rkt已废弃。NewMainKubelet启动了dockershim grpc server来执行docker相关操作。 -
构建了
podWorker来对pod相关的更新逻辑进行处理。
参考文章:
11.5.3 -
kubelet源码分析(三)之 startKubelet
以下代码分析基于
kubernetes v1.12.0版本。
本文主要分析startKubelet,其中主要是kubelet.Run部分,该部分的内容主要是初始化并运行一些manager。对于kubelet所包含的各种manager的执行逻辑和pod的生命周期管理逻辑待后续文章分析。
后续的文章主要会分类分析pkg/kubelet部分的代码实现。
kubelet的pkg代码目录结构:
kubelet
├── apis # 定义一些相关接口
├── cadvisor # cadvisor
├── cm # ContainerManager、cpu manger、cgroup manager
├── config
├── configmap # configmap manager
├── container # Runtime、ImageService
├── dockershim # docker的相关调用
├── eviction # eviction manager
├── images # image manager
├── kubeletconfig
├── kuberuntime # 核心:kubeGenericRuntimeManager、runtime容器的相关操作
├── lifecycle
├── mountpod
├── network # pod dns
├── nodelease
├── nodestatus # MachineInfo、节点相关信息
├── pleg # PodLifecycleEventGenerator
├── pod # 核心:pod manager、mirror pod
├── preemption
├── qos # 资源服务质量,不过暂时内容很少
├── remote # RemoteRuntimeService
├── server
├── stats # StatsProvider
├── status # status manager
├── types # PodUpdate、PodOperation
├── volumemanager # VolumeManager
├── kubelet.go # 核心: SyncHandler、kubelet的大部分操作
├── kubelet_getters.go # 各种get操作,例如获取相关目录:getRootDir、getPodsDir、getPluginsDir
├── kubelet_network.go #
├── kubelet_network_linux.go
├── kubelet_node_status.go # registerWithAPIServer、initialNode、syncNodeStatus
├── kubelet_pods.go # 核心:pod的增删改查等相关操作、podKiller、
├── kubelet_resources.go
├── kubelet_volumes.go # ListVolumesForPod、cleanupOrphanedPodDirs
├── oom_watcher.go # OOMWatcher
├── pod_container_deletor.go
├── pod_workers.go # 核心:PodWorkers、UpdatePodOptions、syncPodOptions、managePodLoop
├── runonce.go # RunOnce
├── runtime.go
...
1. startKubelet
startKubelet的函数位于cmd/kubelet/app/server.go,启动并运行一个kubelet,运行kubelet的逻辑代码位于pkg/kubelet/kubelet.go。
主要内容如下:
- 运行一个kubelet,执行kubelet中各种manager的相关逻辑。
- 运行kubelet server启动监听服务。
此部分代码位于cmd/kubelet/app/server.go
func startKubelet(k kubelet.Bootstrap, podCfg *config.PodConfig, kubeCfg *kubeletconfiginternal.KubeletConfiguration, kubeDeps *kubelet.Dependencies, enableServer bool) {
// start the kubelet
go wait.Until(func() {
k.Run(podCfg.Updates())
}, 0, wait.NeverStop)
// start the kubelet server
if enableServer {
go k.ListenAndServe(net.ParseIP(kubeCfg.Address), uint(kubeCfg.Port), kubeDeps.TLSOptions, kubeDeps.Auth, kubeCfg.EnableDebuggingHandlers, kubeCfg.EnableContentionProfiling)
}
if kubeCfg.ReadOnlyPort > 0 {
go k.ListenAndServeReadOnly(net.ParseIP(kubeCfg.Address), uint(kubeCfg.ReadOnlyPort))
}
}
2. Kubelet.Run
Kubelet.Run方法主要将NewMainKubelet构造的各种manager运行起来,让各种manager执行相应的功能,大部分manager为常驻进程的方式运行。
Kubelet.Run完整代码如下:
此部分代码位于pkg/kubelet/kubelet.go
// Run starts the kubelet reacting to config updates
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
if kl.logServer == nil {
kl.logServer = http.StripPrefix("/logs/", http.FileServer(http.Dir("/var/log/")))
}
if kl.kubeClient == nil {
glog.Warning("No api server defined - no node status update will be sent.")
}
// Start the cloud provider sync manager
if kl.cloudResourceSyncManager != nil {
go kl.cloudResourceSyncManager.Run(wait.NeverStop)
}
if err := kl.initializeModules(); err != nil {
kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.KubeletSetupFailed, err.Error())
glog.Fatal(err)
}
// Start volume manager
go kl.volumeManager.Run(kl.sourcesReady, wait.NeverStop)
if kl.kubeClient != nil {
// Start syncing node status immediately, this may set up things the runtime needs to run.
go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
go kl.fastStatusUpdateOnce()
// start syncing lease
if utilfeature.DefaultFeatureGate.Enabled(features.NodeLease) {
go kl.nodeLeaseController.Run(wait.NeverStop)
}
}
go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
// Start loop to sync iptables util rules
if kl.makeIPTablesUtilChains {
go wait.Until(kl.syncNetworkUtil, 1*time.Minute, wait.NeverStop)
}
// Start a goroutine responsible for killing pods (that are not properly
// handled by pod workers).
go wait.Until(kl.podKiller, 1*time.Second, wait.NeverStop)
// Start component sync loops.
kl.statusManager.Start()
kl.probeManager.Start()
// Start syncing RuntimeClasses if enabled.
if kl.runtimeClassManager != nil {
go kl.runtimeClassManager.Run(wait.NeverStop)
}
// Start the pod lifecycle event generator.
kl.pleg.Start()
kl.syncLoop(updates, kl)
}
以下对Kubelet.Run分段进行分析。
3. initializeModules
initializeModules包含了imageManager、serverCertificateManager、oomWatcher和resourceAnalyzer。
主要流程如下:
- 创建文件系统目录,包括kubelet的root目录、pods的目录、plugins的目录和容器日志目录。
- 启动imageManager、serverCertificateManager、oomWatcher、resourceAnalyzer。
各种manager的说明如下:
imageManager:负责镜像垃圾回收。serverCertificateManager:负责处理证书。oomWatcher:监控内存使用,是否发生内存耗尽。resourceAnalyzer:监控资源使用情况。
完整代码如下:
此部分代码位于pkg/kubelet/kubelet.go
// initializeModules will initialize internal modules that do not require the container runtime to be up.
// Note that the modules here must not depend on modules that are not initialized here.
func (kl *Kubelet) initializeModules() error {
// Prometheus metrics.
metrics.Register(kl.runtimeCache, collectors.NewVolumeStatsCollector(kl))
// Setup filesystem directories.
if err := kl.setupDataDirs(); err != nil {
return err
}
// If the container logs directory does not exist, create it.
if _, err := os.Stat(ContainerLogsDir); err != nil {
if err := kl.os.MkdirAll(ContainerLogsDir, 0755); err != nil {
glog.Errorf("Failed to create directory %q: %v", ContainerLogsDir, err)
}
}
// Start the image manager.
kl.imageManager.Start()
// Start the certificate manager if it was enabled.
if kl.serverCertificateManager != nil {
kl.serverCertificateManager.Start()
}
// Start out of memory watcher.
if err := kl.oomWatcher.Start(kl.nodeRef); err != nil {
return fmt.Errorf("Failed to start OOM watcher %v", err)
}
// Start resource analyzer
kl.resourceAnalyzer.Start()
return nil
}
3.1. setupDataDirs
initializeModules先创建相关目录。
具体目录如下:
ContainerLogsDir:目录为/var/log/containers。rootDirectory:由参数传入,一般为/var/lib/kubelet。PodsDir:目录为{rootDirectory}/pods。PluginsDir:目录为{rootDirectory}/plugins。
initializeModules中setupDataDirs的相关代码如下:
// Setup filesystem directories.
if err := kl.setupDataDirs(); err != nil {
return err
}
// If the container logs directory does not exist, create it.
if _, err := os.Stat(ContainerLogsDir); err != nil {
if err := kl.os.MkdirAll(ContainerLogsDir, 0755); err != nil {
glog.Errorf("Failed to create directory %q: %v", ContainerLogsDir, err)
}
}
setupDataDirs代码如下
// setupDataDirs creates:
// 1. the root directory
// 2. the pods directory
// 3. the plugins directory
func (kl *Kubelet) setupDataDirs() error {
kl.rootDirectory = path.Clean(kl.rootDirectory)
if err := os.MkdirAll(kl.getRootDir(), 0750); err != nil {
return fmt.Errorf("error creating root directory: %v", err)
}
if err := kl.mounter.MakeRShared(kl.getRootDir()); err != nil {
return fmt.Errorf("error configuring root directory: %v", err)
}
if err := os.MkdirAll(kl.getPodsDir(), 0750); err != nil {
return fmt.Errorf("error creating pods directory: %v", err)
}
if err := os.MkdirAll(kl.getPluginsDir(), 0750); err != nil {
return fmt.Errorf("error creating plugins directory: %v", err)
}
return nil
}
3.2. manager
initializeModules中的manager如下:
// Start the image manager.
kl.imageManager.Start()
// Start the certificate manager if it was enabled.
if kl.serverCertificateManager != nil {
kl.serverCertificateManager.Start()
}
// Start out of memory watcher.
if err := kl.oomWatcher.Start(kl.nodeRef); err != nil {
return fmt.Errorf("Failed to start OOM watcher %v", err)
}
// Start resource analyzer
kl.resourceAnalyzer.Start()
4. 运行各种manager
4.1. volumeManager
volumeManager主要运行一组异步循环,根据在此节点上安排的pod调整哪些volume需要attached/detached/mounted/unmounted。
// Start volume manager
go kl.volumeManager.Run(kl.sourcesReady, wait.NeverStop)
volumeManager.Run实现代码如下:
func (vm *volumeManager) Run(sourcesReady config.SourcesReady, stopCh <-chan struct{}) {
defer runtime.HandleCrash()
go vm.desiredStateOfWorldPopulator.Run(sourcesReady, stopCh)
glog.V(2).Infof("The desired_state_of_world populator starts")
glog.Infof("Starting Kubelet Volume Manager")
go vm.reconciler.Run(stopCh)
metrics.Register(vm.actualStateOfWorld, vm.desiredStateOfWorld, vm.volumePluginMgr)
<-stopCh
glog.Infof("Shutting down Kubelet Volume Manager")
}
4.2. syncNodeStatus
syncNodeStatus通过goroutine的方式定期执行,它将节点的状态同步给master,必要的时候注册kubelet。
if kl.kubeClient != nil {
// Start syncing node status immediately, this may set up things the runtime needs to run.
go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
go kl.fastStatusUpdateOnce()
// start syncing lease
if utilfeature.DefaultFeatureGate.Enabled(features.NodeLease) {
go kl.nodeLeaseController.Run(wait.NeverStop)
}
}
4.3. updateRuntimeUp
updateRuntimeUp调用容器运行时状态回调,在容器运行时首次启动时初始化运行时相关模块,如果状态检查失败则返回错误。 如果状态检查正常,在kubelet runtimeState中更新容器运行时的正常运行时间。
go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
4.4. syncNetworkUtil
通过循环的方式同步iptables的规则,不过当前代码并没有执行任何操作。
// Start loop to sync iptables util rules
if kl.makeIPTablesUtilChains {
go wait.Until(kl.syncNetworkUtil, 1*time.Minute, wait.NeverStop)
}
4.5. podKiller
但pod没有被podworker正确处理的时候,启动一个goroutine负责杀死pod。
// Start a goroutine responsible for killing pods (that are not properly
// handled by pod workers).
go wait.Until(kl.podKiller, 1*time.Second, wait.NeverStop)
podKiller代码如下:
此部分代码位于pkg/kubelet/kubelet_pods.go
// podKiller launches a goroutine to kill a pod received from the channel if
// another goroutine isn't already in action.
func (kl *Kubelet) podKiller() {
killing := sets.NewString()
// guard for the killing set
lock := sync.Mutex{}
for podPair := range kl.podKillingCh {
runningPod := podPair.RunningPod
apiPod := podPair.APIPod
lock.Lock()
exists := killing.Has(string(runningPod.ID))
if !exists {
killing.Insert(string(runningPod.ID))
}
lock.Unlock()
if !exists {
go func(apiPod *v1.Pod, runningPod *kubecontainer.Pod) {
glog.V(2).Infof("Killing unwanted pod %q", runningPod.Name)
err := kl.killPod(apiPod, runningPod, nil, nil)
if err != nil {
glog.Errorf("Failed killing the pod %q: %v", runningPod.Name, err)
}
lock.Lock()
killing.Delete(string(runningPod.ID))
lock.Unlock()
}(apiPod, runningPod)
}
}
}
4.6. statusManager
使用apiserver同步pods状态; 也用作状态缓存。
// Start component sync loops.
kl.statusManager.Start()
statusManager.Start的实现代码如下:
func (m *manager) Start() {
// Don't start the status manager if we don't have a client. This will happen
// on the master, where the kubelet is responsible for bootstrapping the pods
// of the master components.
if m.kubeClient == nil {
glog.Infof("Kubernetes client is nil, not starting status manager.")
return
}
glog.Info("Starting to sync pod status with apiserver")
syncTicker := time.Tick(syncPeriod)
// syncPod and syncBatch share the same go routine to avoid sync races.
go wait.Forever(func() {
select {
case syncRequest := <-m.podStatusChannel:
glog.V(5).Infof("Status Manager: syncing pod: %q, with status: (%d, %v) from podStatusChannel",
syncRequest.podUID, syncRequest.status.version, syncRequest.status.status)
m.syncPod(syncRequest.podUID, syncRequest.status)
case <-syncTicker:
m.syncBatch()
}
}, 0)
}
4.7. probeManager
处理容器探针
kl.probeManager.Start()
4.8. runtimeClassManager
// Start syncing RuntimeClasses if enabled.
if kl.runtimeClassManager != nil {
go kl.runtimeClassManager.Run(wait.NeverStop)
}
4.9. PodLifecycleEventGenerator
// Start the pod lifecycle event generator.
kl.pleg.Start()
PodLifecycleEventGenerator是一个pod生命周期时间生成器接口,具体如下:
// PodLifecycleEventGenerator contains functions for generating pod life cycle events.
type PodLifecycleEventGenerator interface {
Start()
Watch() chan *PodLifecycleEvent
Healthy() (bool, error)
}
start方法具体实现如下:
// Start spawns a goroutine to relist periodically.
func (g *GenericPLEG) Start() {
go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)
}
4.10. syncLoop
最后调用syncLoop来执行同步变化变更的循环。
kl.syncLoop(updates, kl)
5. syncLoop
syncLoop是处理变化的循环。 它监听来自三种channel(file,apiserver和http)的更改。 对于看到的任何新更改,将针对所需状态和运行状态运行同步。 如果没有看到配置的变化,将在每个同步频率秒同步最后已知的所需状态。
// syncLoop is the main loop for processing changes. It watches for changes from
// three channels (file, apiserver, and http) and creates a union of them. For
// any new change seen, will run a sync against desired state and running state. If
// no changes are seen to the configuration, will synchronize the last known desired
// state every sync-frequency seconds. Never returns.
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
glog.Info("Starting kubelet main sync loop.")
// The resyncTicker wakes up kubelet to checks if there are any pod workers
// that need to be sync'd. A one-second period is sufficient because the
// sync interval is defaulted to 10s.
syncTicker := time.NewTicker(time.Second)
defer syncTicker.Stop()
housekeepingTicker := time.NewTicker(housekeepingPeriod)
defer housekeepingTicker.Stop()
plegCh := kl.pleg.Watch()
const (
base = 100 * time.Millisecond
max = 5 * time.Second
factor = 2
)
duration := base
for {
if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
glog.Infof("skipping pod synchronization - %v", rs)
// exponential backoff
time.Sleep(duration)
duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
continue
}
// reset backoff if we have a success
duration = base
kl.syncLoopMonitor.Store(kl.clock.Now())
if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
break
}
kl.syncLoopMonitor.Store(kl.clock.Now())
}
}
其中调用了syncLoopIteration的函数来执行更具体的监控pod变化的循环。syncLoopIteration代码逻辑待后续单独分析。
6. 总结
6.1. 基本流程
Kubelet.Run主要流程如下:
- 初始化模块,其实就是运行
imageManager、serverCertificateManager、oomWatcher、resourceAnalyzer。 - 运行各种manager,大部分以常驻goroutine的方式运行,其中包括
volumeManager、statusManager等。 - 执行处理变更的循环函数
syncLoop,对pod的生命周期进行管理。
syncLoop:
syncLoop函数,对pod的生命周期进行管理,其中syncLoop调用了syncLoopIteration函数,该函数根据podUpdate的信息,针对不同的操作,由SyncHandler来执行pod的增删改查等生命周期的管理,其中的syncHandler包括HandlePodSyncs和HandlePodCleanups等。该部分逻辑待后续文章具体分析。
6.2. Manager
以下介绍kubelet运行时涉及到的manager的内容。
| manager | 说明 |
|---|---|
| imageManager | 负责镜像垃圾回收 |
| serverCertificateManager | 负责处理证书 |
| oomWatcher | 监控内存使用,是否发生内存耗尽即OOM |
| resourceAnalyzer | 监控资源使用情况 |
| volumeManager | 对pod执行attached/detached/mounted/unmounted操作 |
| statusManager | 使用apiserver同步pods状态; 也用作状态缓存 |
| probeManager | 处理容器探针 |
| runtimeClassManager | 同步RuntimeClasses |
| podKiller | 负责杀死pod |
参考文章:
11.5.4 -
kubelet源码分析(四)之 syncLoopIteration
以下代码分析基于
kubernetes v1.12.0版本。
本文主要分析kubelet中syncLoopIteration部分。syncLoopIteration通过几种channel来对不同类型的事件进行监听并做增删改查的处理。
1. syncLoop
syncLoop是处理变更的循环。 它监听来自三种channel(file,apiserver和http)的更改。 对于看到的任何新更改,将针对所需状态和运行状态运行同步。 如果没有看到配置的变化,将在每个同步频率秒同步最后已知的所需状态。
此部分代码位于pkg/kubelet/kubelet.go
// syncLoop is the main loop for processing changes. It watches for changes from
// three channels (file, apiserver, and http) and creates a union of them. For
// any new change seen, will run a sync against desired state and running state. If
// no changes are seen to the configuration, will synchronize the last known desired
// state every sync-frequency seconds. Never returns.
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
glog.Info("Starting kubelet main sync loop.")
// The resyncTicker wakes up kubelet to checks if there are any pod workers
// that need to be sync'd. A one-second period is sufficient because the
// sync interval is defaulted to 10s.
syncTicker := time.NewTicker(time.Second)
defer syncTicker.Stop()
housekeepingTicker := time.NewTicker(housekeepingPeriod)
defer housekeepingTicker.Stop()
plegCh := kl.pleg.Watch()
const (
base = 100 * time.Millisecond
max = 5 * time.Second
factor = 2
)
duration := base
for {
if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
glog.Infof("skipping pod synchronization - %v", rs)
// exponential backoff
time.Sleep(duration)
duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
continue
}
// reset backoff if we have a success
duration = base
kl.syncLoopMonitor.Store(kl.clock.Now())
if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
break
}
kl.syncLoopMonitor.Store(kl.clock.Now())
}
}
其中调用了syncLoopIteration的函数来执行更具体的监控pod变化的循环。
2. syncLoopIteration
syncLoopIteration主要通过几种channel来对不同类型的事件进行监听并处理。其中包括:configCh、plegCh、syncCh、houseKeepingCh、livenessManager.Updates()。
syncLoopIteration实际执行了pod的操作,此部分设置了几种不同的channel:
configCh:将配置更改的pod分派给事件类型的相应处理程序回调。plegCh:更新runtime缓存,同步pod。syncCh:同步所有等待同步的pod。houseKeepingCh:触发清理pod。livenessManager.Updates():对失败的pod或者liveness检查失败的pod进行sync操作。
syncLoopIteration部分代码位于pkg/kubelet/kubelet.go
2.1. configCh
configCh将配置更改的pod分派给事件类型的相应处理程序回调,该部分主要通过SyncHandler对pod的不同事件进行增删改查等操作。
func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
select {
case u, open := <-configCh:
// Update from a config source; dispatch it to the right handler
// callback.
if !open {
glog.Errorf("Update channel is closed. Exiting the sync loop.")
return false
}
switch u.Op {
case kubetypes.ADD:
glog.V(2).Infof("SyncLoop (ADD, %q): %q", u.Source, format.Pods(u.Pods))
// After restarting, kubelet will get all existing pods through
// ADD as if they are new pods. These pods will then go through the
// admission process and *may* be rejected. This can be resolved
// once we have checkpointing.
handler.HandlePodAdditions(u.Pods)
case kubetypes.UPDATE:
glog.V(2).Infof("SyncLoop (UPDATE, %q): %q", u.Source, format.PodsWithDeletionTimestamps(u.Pods))
handler.HandlePodUpdates(u.Pods)
case kubetypes.REMOVE:
glog.V(2).Infof("SyncLoop (REMOVE, %q): %q", u.Source, format.Pods(u.Pods))
handler.HandlePodRemoves(u.Pods)
case kubetypes.RECONCILE:
glog.V(4).Infof("SyncLoop (RECONCILE, %q): %q", u.Source, format.Pods(u.Pods))
handler.HandlePodReconcile(u.Pods)
case kubetypes.DELETE:
glog.V(2).Infof("SyncLoop (DELETE, %q): %q", u.Source, format.Pods(u.Pods))
// DELETE is treated as a UPDATE because of graceful deletion.
handler.HandlePodUpdates(u.Pods)
case kubetypes.RESTORE:
glog.V(2).Infof("SyncLoop (RESTORE, %q): %q", u.Source, format.Pods(u.Pods))
// These are pods restored from the checkpoint. Treat them as new
// pods.
handler.HandlePodAdditions(u.Pods)
case kubetypes.SET:
// TODO: Do we want to support this?
glog.Errorf("Kubelet does not support snapshot update")
}
...
}
可以看出syncLoopIteration根据podUpdate的值来执行不同的pod操作,具体如下:
ADD:HandlePodAdditionsUPDATE:HandlePodUpdatesREMOVE:HandlePodRemovesRECONCILE:HandlePodReconcileDELETE:HandlePodUpdatesRESTORE:HandlePodAdditionspodsToSync:HandlePodSyncs
其中执行pod的handler操作的是SyncHandler,该类型是一个接口,实现体为kubelet本身,具体见后续分析。
2.2. plegCh
plegCh:更新runtime缓存,同步pod。此处调用了HandlePodSyncs的函数。
case e := <-plegCh:
if isSyncPodWorthy(e) {
// PLEG event for a pod; sync it.
if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
glog.V(2).Infof("SyncLoop (PLEG): %q, event: %#v", format.Pod(pod), e)
handler.HandlePodSyncs([]*v1.Pod{pod})
} else {
// If the pod no longer exists, ignore the event.
glog.V(4).Infof("SyncLoop (PLEG): ignore irrelevant event: %#v", e)
}
}
if e.Type == pleg.ContainerDied {
if containerID, ok := e.Data.(string); ok {
kl.cleanUpContainersInPod(e.ID, containerID)
}
}
2.3. syncCh
syncCh:同步所有等待同步的pod。此处调用了HandlePodSyncs的函数。
case <-syncCh:
// Sync pods waiting for sync
podsToSync := kl.getPodsToSync()
if len(podsToSync) == 0 {
break
}
glog.V(4).Infof("SyncLoop (SYNC): %d pods; %s", len(podsToSync), format.Pods(podsToSync))
handler.HandlePodSyncs(podsToSync)
2.4. livenessManager.Update
livenessManager.Updates():对失败的pod或者liveness检查失败的pod进行sync操作。此处调用了HandlePodSyncs的函数。
case update := <-kl.livenessManager.Updates():
if update.Result == proberesults.Failure {
// The liveness manager detected a failure; sync the pod.
// We should not use the pod from livenessManager, because it is never updated after
// initialization.
pod, ok := kl.podManager.GetPodByUID(update.PodUID)
if !ok {
// If the pod no longer exists, ignore the update.
glog.V(4).Infof("SyncLoop (container unhealthy): ignore irrelevant update: %#v", update)
break
}
glog.V(1).Infof("SyncLoop (container unhealthy): %q", format.Pod(pod))
handler.HandlePodSyncs([]*v1.Pod{pod})
}
2.5. housekeepingCh
houseKeepingCh:触发清理pod。此处调用了HandlePodCleanups的函数。
case <-housekeepingCh:
if !kl.sourcesReady.AllReady() {
// If the sources aren't ready or volume manager has not yet synced the states,
// skip housekeeping, as we may accidentally delete pods from unready sources.
glog.V(4).Infof("SyncLoop (housekeeping, skipped): sources aren't ready yet.")
} else {
glog.V(4).Infof("SyncLoop (housekeeping)")
if err := handler.HandlePodCleanups(); err != nil {
glog.Errorf("Failed cleaning pods: %v", err)
}
}
3. SyncHandler
SyncHandler是一个定义Pod的不同Handler的接口,具体是实现者是kubelet,该接口的方法主要在syncLoopIteration中调用,接口定义如下:
// SyncHandler is an interface implemented by Kubelet, for testability
type SyncHandler interface {
HandlePodAdditions(pods []*v1.Pod)
HandlePodUpdates(pods []*v1.Pod)
HandlePodRemoves(pods []*v1.Pod)
HandlePodReconcile(pods []*v1.Pod)
HandlePodSyncs(pods []*v1.Pod)
HandlePodCleanups() error
}
SyncHandler部分代码位于pkg/kubelet/kubelet.go
3.1. HandlePodAdditions
HandlePodAdditions先根据pod创建时间对pod进行排序,然后遍历pod列表,来执行pod的相关操作。
// HandlePodAdditions is the callback in SyncHandler for pods being added from
// a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
start := kl.clock.Now()
sort.Sort(sliceutils.PodsByCreationTime(pods))
for _, pod := range pods {
...
}
}
将pod添加到pod manager中。
for _, pod := range pods {
// Responsible for checking limits in resolv.conf
if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
kl.dnsConfigurer.CheckLimitsForResolvConf()
}
existingPods := kl.podManager.GetPods()
// Always add the pod to the pod manager. Kubelet relies on the pod
// manager as the source of truth for the desired state. If a pod does
// not exist in the pod manager, it means that it has been deleted in
// the apiserver and no action (other than cleanup) is required.
kl.podManager.AddPod(pod)
...
}
如果是mirror pod,则对mirror pod进行处理。
if kubepod.IsMirrorPod(pod) {
kl.handleMirrorPod(pod, start)
continue
}
如果当前pod的状态不是Terminated状态,则判断是否接受该pod,如果不接受则将pod状态改为Failed。
if !kl.podIsTerminated(pod) {
// Only go through the admission process if the pod is not
// terminated.
// We failed pods that we rejected, so activePods include all admitted
// pods that are alive.
activePods := kl.filterOutTerminatedPods(existingPods)
// Check if we can admit the pod; if not, reject it.
if ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {
kl.rejectPod(pod, reason, message)
continue
}
}
执行dispatchWork函数,该函数是syncHandler中调用到的核心函数,该函数在pod worker中启动一个异步循环,来分派pod的相关操作。该函数的具体操作待后续分析。
mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
最后加pod添加到probe manager中。
kl.probeManager.AddPod(pod)
3.2. HandlePodUpdates
HandlePodUpdates同样遍历pod列表,执行相应的操作。
// HandlePodUpdates is the callback in the SyncHandler interface for pods
// being updated from a config source.
func (kl *Kubelet) HandlePodUpdates(pods []*v1.Pod) {
start := kl.clock.Now()
for _, pod := range pods {
...
}
}
将pod更新到pod manager中。
for _, pod := range pods {
// Responsible for checking limits in resolv.conf
if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
kl.dnsConfigurer.CheckLimitsForResolvConf()
}
kl.podManager.UpdatePod(pod)
...
}
如果是mirror pod,则对mirror pod进行处理。
if kubepod.IsMirrorPod(pod) {
kl.handleMirrorPod(pod, start)
continue
}
执行dispatchWork函数。
// TODO: Evaluate if we need to validate and reject updates.
mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
kl.dispatchWork(pod, kubetypes.SyncPodUpdate, mirrorPod, start)
3.3. HandlePodRemoves
HandlePodRemoves遍历pod列表。
// HandlePodRemoves is the callback in the SyncHandler interface for pods
// being removed from a config source.
func (kl *Kubelet) HandlePodRemoves(pods []*v1.Pod) {
start := kl.clock.Now()
for _, pod := range pods {
...
}
}
从pod manager中删除pod。
for _, pod := range pods {
kl.podManager.DeletePod(pod)
...
}
如果是mirror pod,则对mirror pod进行处理。
if kubepod.IsMirrorPod(pod) {
kl.handleMirrorPod(pod, start)
continue
}
调用kubelet的deletePod函数来删除pod。
// Deletion is allowed to fail because the periodic cleanup routine
// will trigger deletion again.
if err := kl.deletePod(pod); err != nil {
glog.V(2).Infof("Failed to delete pod %q, err: %v", format.Pod(pod), err)
}
deletePod 函数将需要删除的pod加入podKillingCh的channel中,有podKiller监听这个channel去执行删除任务,实现如下:
// deletePod deletes the pod from the internal state of the kubelet by:
// 1. stopping the associated pod worker asynchronously
// 2. signaling to kill the pod by sending on the podKillingCh channel
//
// deletePod returns an error if not all sources are ready or the pod is not
// found in the runtime cache.
func (kl *Kubelet) deletePod(pod *v1.Pod) error {
if pod == nil {
return fmt.Errorf("deletePod does not allow nil pod")
}
if !kl.sourcesReady.AllReady() {
// If the sources aren't ready, skip deletion, as we may accidentally delete pods
// for sources that haven't reported yet.
return fmt.Errorf("skipping delete because sources aren't ready yet")
}
kl.podWorkers.ForgetWorker(pod.UID)
// Runtime cache may not have been updated to with the pod, but it's okay
// because the periodic cleanup routine will attempt to delete again later.
runningPods, err := kl.runtimeCache.GetPods()
if err != nil {
return fmt.Errorf("error listing containers: %v", err)
}
runningPod := kubecontainer.Pods(runningPods).FindPod("", pod.UID)
if runningPod.IsEmpty() {
return fmt.Errorf("pod not found")
}
podPair := kubecontainer.PodPair{APIPod: pod, RunningPod: &runningPod}
kl.podKillingCh <- &podPair
// TODO: delete the mirror pod here?
// We leave the volume/directory cleanup to the periodic cleanup routine.
return nil
}
从probe manager中移除pod。
kl.probeManager.RemovePod(pod)
3.4. HandlePodReconcile
遍历pod列表。
// HandlePodReconcile is the callback in the SyncHandler interface for pods
// that should be reconciled.
func (kl *Kubelet) HandlePodReconcile(pods []*v1.Pod) {
start := kl.clock.Now()
for _, pod := range pods {
...
}
}
将pod更新到pod manager中。
for _, pod := range pods {
// Update the pod in pod manager, status manager will do periodically reconcile according
// to the pod manager.
kl.podManager.UpdatePod(pod)
...
}
必要时调整pod的Ready状态,执行dispatchWork函数。
// Reconcile Pod "Ready" condition if necessary. Trigger sync pod for reconciliation.
if status.NeedToReconcilePodReadiness(pod) {
mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
kl.dispatchWork(pod, kubetypes.SyncPodSync, mirrorPod, start)
}
如果pod被设定为需要被驱逐的,则删除pod中的容器。
// After an evicted pod is synced, all dead containers in the pod can be removed.
if eviction.PodIsEvicted(pod.Status) {
if podStatus, err := kl.podCache.Get(pod.UID); err == nil {
kl.containerDeletor.deleteContainersInPod("", podStatus, true)
}
}
3.5. HandlePodSyncs
HandlePodSyncs是syncHandler接口回调函数,调用dispatchWork,通过pod worker来执行任务。
// HandlePodSyncs is the callback in the syncHandler interface for pods
// that should be dispatched to pod workers for sync.
func (kl *Kubelet) HandlePodSyncs(pods []*v1.Pod) {
start := kl.clock.Now()
for _, pod := range pods {
mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
kl.dispatchWork(pod, kubetypes.SyncPodSync, mirrorPod, start)
}
}
3.6. HandlePodCleanups
HandlePodCleanups主要用来执行pod的清理任务,其中包括terminating的pod,orphaned的pod等。
首先查看pod使用到的cgroup。
// HandlePodCleanups performs a series of cleanup work, including terminating
// pod workers, killing unwanted pods, and removing orphaned volumes/pod
// directories.
// NOTE: This function is executed by the main sync loop, so it
// should not contain any blocking calls.
func (kl *Kubelet) HandlePodCleanups() error {
// The kubelet lacks checkpointing, so we need to introspect the set of pods
// in the cgroup tree prior to inspecting the set of pods in our pod manager.
// this ensures our view of the cgroup tree does not mistakenly observe pods
// that are added after the fact...
var (
cgroupPods map[types.UID]cm.CgroupName
err error
)
if kl.cgroupsPerQOS {
pcm := kl.containerManager.NewPodContainerManager()
cgroupPods, err = pcm.GetAllPodsFromCgroups()
if err != nil {
return fmt.Errorf("failed to get list of pods that still exist on cgroup mounts: %v", err)
}
}
...
}
列出所有pod包括mirror pod。
allPods, mirrorPods := kl.podManager.GetPodsAndMirrorPods()
// Pod phase progresses monotonically. Once a pod has reached a final state,
// it should never leave regardless of the restart policy. The statuses
// of such pods should not be changed, and there is no need to sync them.
// TODO: the logic here does not handle two cases:
// 1. If the containers were removed immediately after they died, kubelet
// may fail to generate correct statuses, let alone filtering correctly.
// 2. If kubelet restarted before writing the terminated status for a pod
// to the apiserver, it could still restart the terminated pod (even
// though the pod was not considered terminated by the apiserver).
// These two conditions could be alleviated by checkpointing kubelet.
activePods := kl.filterOutTerminatedPods(allPods)
desiredPods := make(map[types.UID]empty)
for _, pod := range activePods {
desiredPods[pod.UID] = empty{}
}
pod worker停止不再存在的pod的任务,并从probe manager中清除pod。
// Stop the workers for no-longer existing pods.
// TODO: is here the best place to forget pod workers?
kl.podWorkers.ForgetNonExistingPodWorkers(desiredPods)
kl.probeManager.CleanupPods(activePods)
将需要杀死的pod加入到podKillingCh的channel中,podKiller的任务会监听该channel并获取需要杀死的pod列表来执行杀死pod的操作。
runningPods, err := kl.runtimeCache.GetPods()
if err != nil {
glog.Errorf("Error listing containers: %#v", err)
return err
}
for _, pod := range runningPods {
if _, found := desiredPods[pod.ID]; !found {
kl.podKillingCh <- &kubecontainer.PodPair{APIPod: nil, RunningPod: pod}
}
}
当pod不再被绑定到该节点,移除podStatus,其中removeOrphanedPodStatuses最后调用的函数是statusManager的RemoveOrphanedStatuses方法。
kl.removeOrphanedPodStatuses(allPods, mirrorPods)
移除所有的orphaned volume。
// Remove any orphaned volumes.
// Note that we pass all pods (including terminated pods) to the function,
// so that we don't remove volumes associated with terminated but not yet
// deleted pods.
err = kl.cleanupOrphanedPodDirs(allPods, runningPods)
if err != nil {
// We want all cleanup tasks to be run even if one of them failed. So
// we just log an error here and continue other cleanup tasks.
// This also applies to the other clean up tasks.
glog.Errorf("Failed cleaning up orphaned pod directories: %v", err)
}
移除mirror pod。
// Remove any orphaned mirror pods.
kl.podManager.DeleteOrphanedMirrorPods()
删除不再运行的pod的cgroup。
// Remove any cgroups in the hierarchy for pods that are no longer running.
if kl.cgroupsPerQOS {
kl.cleanupOrphanedPodCgroups(cgroupPods, activePods)
}
执行垃圾回收(GC)操作。
kl.backOff.GC()
4. dispatchWork
dispatchWork通过pod worker启动一个异步的循环。
完整代码如下:
// dispatchWork starts the asynchronous sync of the pod in a pod worker.
// If the pod is terminated, dispatchWork
func (kl *Kubelet) dispatchWork(pod *v1.Pod, syncType kubetypes.SyncPodType, mirrorPod *v1.Pod, start time.Time) {
if kl.podIsTerminated(pod) {
if pod.DeletionTimestamp != nil {
// If the pod is in a terminated state, there is no pod worker to
// handle the work item. Check if the DeletionTimestamp has been
// set, and force a status update to trigger a pod deletion request
// to the apiserver.
kl.statusManager.TerminatePod(pod)
}
return
}
// Run the sync in an async worker.
kl.podWorkers.UpdatePod(&UpdatePodOptions{
Pod: pod,
MirrorPod: mirrorPod,
UpdateType: syncType,
OnCompleteFunc: func(err error) {
if err != nil {
metrics.PodWorkerLatency.WithLabelValues(syncType.String()).Observe(metrics.SinceInMicroseconds(start))
}
},
})
// Note the number of containers for new pods.
if syncType == kubetypes.SyncPodCreate {
metrics.ContainersPerPodCount.Observe(float64(len(pod.Spec.Containers)))
}
}
以下分段进行分析:
如果pod的状态是处于Terminated状态,则执行statusManager的TerminatePod操作。
// dispatchWork starts the asynchronous sync of the pod in a pod worker.
// If the pod is terminated, dispatchWork
func (kl *Kubelet) dispatchWork(pod *v1.Pod, syncType kubetypes.SyncPodType, mirrorPod *v1.Pod, start time.Time) {
if kl.podIsTerminated(pod) {
if pod.DeletionTimestamp != nil {
// If the pod is in a terminated state, there is no pod worker to
// handle the work item. Check if the DeletionTimestamp has been
// set, and force a status update to trigger a pod deletion request
// to the apiserver.
kl.statusManager.TerminatePod(pod)
}
return
}
...
}
执行pod worker的UpdatePod函数,该函数是pod worker的核心函数,来执行pod相关操作。具体逻辑待下文分析。
// Run the sync in an async worker.
kl.podWorkers.UpdatePod(&UpdatePodOptions{
Pod: pod,
MirrorPod: mirrorPod,
UpdateType: syncType,
OnCompleteFunc: func(err error) {
if err != nil {
metrics.PodWorkerLatency.WithLabelValues(syncType.String()).Observe(metrics.SinceInMicroseconds(start))
}
},
})
当创建类型是SyncPodCreate(即创建pod的时候),统计新pod中容器的数目。
// Note the number of containers for new pods.
if syncType == kubetypes.SyncPodCreate {
metrics.ContainersPerPodCount.Observe(float64(len(pod.Spec.Containers)))
}
5. PodWorkers.UpdatePod
PodWorkers是一个接口类型:
// PodWorkers is an abstract interface for testability.
type PodWorkers interface {
UpdatePod(options *UpdatePodOptions)
ForgetNonExistingPodWorkers(desiredPods map[types.UID]empty)
ForgetWorker(uid types.UID)
}
其中UpdatePod是一个核心方法,通过podUpdates的channel来传递需要处理的pod信息,对于新创建的pod每个pod都会由一个goroutine来执行managePodLoop。
此部分代码位于pkg/kubelet/pod_workers.go
// Apply the new setting to the specified pod.
// If the options provide an OnCompleteFunc, the function is invoked if the update is accepted.
// Update requests are ignored if a kill pod request is pending.
func (p *podWorkers) UpdatePod(options *UpdatePodOptions) {
pod := options.Pod
uid := pod.UID
var podUpdates chan UpdatePodOptions
var exists bool
p.podLock.Lock()
defer p.podLock.Unlock()
if podUpdates, exists = p.podUpdates[uid]; !exists {
// We need to have a buffer here, because checkForUpdates() method that
// puts an update into channel is called from the same goroutine where
// the channel is consumed. However, it is guaranteed that in such case
// the channel is empty, so buffer of size 1 is enough.
podUpdates = make(chan UpdatePodOptions, 1)
p.podUpdates[uid] = podUpdates
// Creating a new pod worker either means this is a new pod, or that the
// kubelet just restarted. In either case the kubelet is willing to believe
// the status of the pod for the first pod worker sync. See corresponding
// comment in syncPod.
go func() {
defer runtime.HandleCrash()
p.managePodLoop(podUpdates)
}()
}
if !p.isWorking[pod.UID] {
p.isWorking[pod.UID] = true
podUpdates <- *options
} else {
// if a request to kill a pod is pending, we do not let anything overwrite that request.
update, found := p.lastUndeliveredWorkUpdate[pod.UID]
if !found || update.UpdateType != kubetypes.SyncPodKill {
p.lastUndeliveredWorkUpdate[pod.UID] = *options
}
}
}
6. managePodLoop
managePodLoop通过读取podUpdateschannel的信息,执行syncPodFn函数,而syncPodFn函数在newPodWorkers的时候赋值了,即kubelet.syncPod。kubelet.syncPod具体代码逻辑待后续文章单独分析。
// newPodWorkers传入syncPod函数
klet.podWorkers = newPodWorkers(klet.syncPod, kubeDeps.Recorder, klet.workQueue, klet.resyncInterval, backOffPeriod, klet.podCache)
newPodWorkers函数参考:
func newPodWorkers(syncPodFn syncPodFnType, recorder record.EventRecorder, workQueue queue.WorkQueue,
resyncInterval, backOffPeriod time.Duration, podCache kubecontainer.Cache) *podWorkers {
return &podWorkers{
podUpdates: map[types.UID]chan UpdatePodOptions{},
isWorking: map[types.UID]bool{},
lastUndeliveredWorkUpdate: map[types.UID]UpdatePodOptions{},
syncPodFn: syncPodFn, // 构造传入klet.syncPod函数
recorder: recorder,
workQueue: workQueue,
resyncInterval: resyncInterval,
backOffPeriod: backOffPeriod,
podCache: podCache,
}
}
managePodLoop函数参考:
此部分代码位于pkg/kubelet/pod_workers.go
func (p *podWorkers) managePodLoop(podUpdates <-chan UpdatePodOptions) {
var lastSyncTime time.Time
for update := range podUpdates {
err := func() error {
podUID := update.Pod.UID
// This is a blocking call that would return only if the cache
// has an entry for the pod that is newer than minRuntimeCache
// Time. This ensures the worker doesn't start syncing until
// after the cache is at least newer than the finished time of
// the previous sync.
status, err := p.podCache.GetNewerThan(podUID, lastSyncTime)
if err != nil {
// This is the legacy event thrown by manage pod loop
// all other events are now dispatched from syncPodFn
p.recorder.Eventf(update.Pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
return err
}
err = p.syncPodFn(syncPodOptions{
mirrorPod: update.MirrorPod,
pod: update.Pod,
podStatus: status,
killPodOptions: update.KillPodOptions,
updateType: update.UpdateType,
})
lastSyncTime = time.Now()
return err
}()
// notify the call-back function if the operation succeeded or not
if update.OnCompleteFunc != nil {
update.OnCompleteFunc(err)
}
if err != nil {
// IMPORTANT: we do not log errors here, the syncPodFn is responsible for logging errors
glog.Errorf("Error syncing pod %s (%q), skipping: %v", update.Pod.UID, format.Pod(update.Pod), err)
}
p.wrapUp(update.Pod.UID, err)
}
}
7. 总结
syncLoopIteration基本流程如下:
- 通过几种
channel来对不同类型的事件进行监听并处理。其中channel包括:configCh、plegCh、syncCh、houseKeepingCh、livenessManager.Updates()。 - 不同的SyncHandler执行不同的增删改查操作。
- 其中
HandlePodAdditions、HandlePodUpdates、HandlePodReconcile、HandlePodSyncs都调用到了dispatchWork来执行pod的相关操作。HandlePodCleanups的pod清理任务,通过channel的方式加需要清理的pod给podKiller来清理。 dispatchWork调用podWorkers.UpdatePod执行异步操作。podWorkers.UpdatePod中调用managePodLoop来执行pod相关操作循环。
channel类型及作用:
configCh:将配置更改的pod分派给事件类型的相应处理程序回调。plegCh:更新runtime缓存,同步pod。syncCh:同步所有等待同步的pod。houseKeepingCh:触发清理pod。livenessManager.Updates():对失败的pod或者liveness检查失败的pod进行sync操作。
参考:
11.5.5 -
kubelet源码分析(五)之 syncPod
以下代码分析基于
kubernetes v1.12.0版本。
本文主要分析kubelet中syncPod的部分。
1. managePodLoop
managePodLoop通过读取podUpdateschannel的信息,执行syncPodFn函数,而syncPodFn函数在newPodWorkers的时候赋值了,即kubelet.syncPod。
managePodLoop完整代码如下:
此部分代码位于pkg/kubelet/pod_workers.go
func (p *podWorkers) managePodLoop(podUpdates <-chan UpdatePodOptions) {
var lastSyncTime time.Time
for update := range podUpdates {
err := func() error {
podUID := update.Pod.UID
// This is a blocking call that would return only if the cache
// has an entry for the pod that is newer than minRuntimeCache
// Time. This ensures the worker doesn't start syncing until
// after the cache is at least newer than the finished time of
// the previous sync.
status, err := p.podCache.GetNewerThan(podUID, lastSyncTime)
if err != nil {
// This is the legacy event thrown by manage pod loop
// all other events are now dispatched from syncPodFn
p.recorder.Eventf(update.Pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
return err
}
// 该部分的syncPodFn实际上的实现函数是kubelet.syncPod
err = p.syncPodFn(syncPodOptions{
mirrorPod: update.MirrorPod,
pod: update.Pod,
podStatus: status,
killPodOptions: update.KillPodOptions,
updateType: update.UpdateType,
})
lastSyncTime = time.Now()
return err
}()
// notify the call-back function if the operation succeeded or not
if update.OnCompleteFunc != nil {
update.OnCompleteFunc(err)
}
if err != nil {
// IMPORTANT: we do not log errors here, the syncPodFn is responsible for logging errors
glog.Errorf("Error syncing pod %s (%q), skipping: %v", update.Pod.UID, format.Pod(update.Pod), err)
}
p.wrapUp(update.Pod.UID, err)
}
}
以下分析syncPod相关逻辑。
2. syncPod
syncPod可以理解为是一个单个pod进行同步任务的事务脚本。其中入参是syncPodOptions,syncPodOptions记录了需要同步的pod的相关信息。具体定义如下:
// syncPodOptions provides the arguments to a SyncPod operation.
type syncPodOptions struct {
// the mirror pod for the pod to sync, if it is a static pod
mirrorPod *v1.Pod
// pod to sync
pod *v1.Pod
// the type of update (create, update, sync)
updateType kubetypes.SyncPodType
// the current status
podStatus *kubecontainer.PodStatus
// if update type is kill, use the specified options to kill the pod.
killPodOptions *KillPodOptions
}
syncPod主要执行以下的工作流:
- 如果是正在创建的pod,则记录pod worker的启动
latency。 - 调用
generateAPIPodStatus为pod提供v1.PodStatus信息。 - 如果pod是第一次运行,记录pod的启动
latency。 - 更新
status manager中的pod状态。 - 如果pod不应该被运行则杀死pod。
- 如果pod是一个
static pod,并且没有对应的mirror pod,则创建一个mirror pod。 - 如果没有pod的数据目录则给pod创建对应的数据目录。
- 等待volume被attach或mount。
- 获取pod的secret数据。
- 调用
container runtime的SyncPod函数,执行相关pod操作。 - 更新pod的
ingress和egress的traffic limit。
当以上任务流中有任何的error,则return error。在下一次执行syncPod的任务流会被再次执行。对于错误信息会被记录到event中,方便debug。
以下对syncPod的执行过程进行分析。
syncPod的代码位于pkg/kubelet/kubelet.go
2.1. SyncPodKill
首先,获取syncPodOptions的pod信息。
func (kl *Kubelet) syncPod(o syncPodOptions) error {
// pull out the required options
pod := o.pod
mirrorPod := o.mirrorPod
podStatus := o.podStatus
updateType := o.updateType
...
}
如果pod是需要被杀死的,则执行killPod,会在指定的宽限期内杀死pod。
// if we want to kill a pod, do it now!
if updateType == kubetypes.SyncPodKill {
killPodOptions := o.killPodOptions
if killPodOptions == nil || killPodOptions.PodStatusFunc == nil {
return fmt.Errorf("kill pod options are required if update type is kill")
}
apiPodStatus := killPodOptions.PodStatusFunc(pod, podStatus)
kl.statusManager.SetPodStatus(pod, apiPodStatus)
// we kill the pod with the specified grace period since this is a termination
if err := kl.killPod(pod, nil, podStatus, killPodOptions.PodTerminationGracePeriodSecondsOverride); err != nil {
kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, "error killing pod: %v", err)
// there was an error killing the pod, so we return that error directly
utilruntime.HandleError(err)
return err
}
return nil
}
2.2. SyncPodCreate
如果pod是需要被创建的,则记录pod的启动latency,latency与pod在apiserver中第一次被记录相关。
// Latency measurements for the main workflow are relative to the
// first time the pod was seen by the API server.
var firstSeenTime time.Time
if firstSeenTimeStr, ok := pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey]; ok {
firstSeenTime = kubetypes.ConvertToTimestamp(firstSeenTimeStr).Get()
}
// Record pod worker start latency if being created
// TODO: make pod workers record their own latencies
if updateType == kubetypes.SyncPodCreate {
if !firstSeenTime.IsZero() {
// This is the first time we are syncing the pod. Record the latency
// since kubelet first saw the pod if firstSeenTime is set.
metrics.PodWorkerStartLatency.Observe(metrics.SinceInMicroseconds(firstSeenTime))
} else {
glog.V(3).Infof("First seen time not recorded for pod %q", pod.UID)
}
}
通过pod和pod status生成最终的api pod status并设置pod的IP。
// Generate final API pod status with pod and status manager status
apiPodStatus := kl.generateAPIPodStatus(pod, podStatus)
// The pod IP may be changed in generateAPIPodStatus if the pod is using host network. (See #24576)
// TODO(random-liu): After writing pod spec into container labels, check whether pod is using host network, and
// set pod IP to hostIP directly in runtime.GetPodStatus
podStatus.IP = apiPodStatus.PodIP
记录pod到running状态的时间。
// Record the time it takes for the pod to become running.
existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
!firstSeenTime.IsZero() {
metrics.PodStartLatency.Observe(metrics.SinceInMicroseconds(firstSeenTime))
}
如果pod是不可运行的,则更新pod和container的状态和相应的原因。
runnable := kl.canRunPod(pod)
if !runnable.Admit {
// Pod is not runnable; update the Pod and Container statuses to why.
apiPodStatus.Reason = runnable.Reason
apiPodStatus.Message = runnable.Message
// Waiting containers are not creating.
const waitingReason = "Blocked"
for _, cs := range apiPodStatus.InitContainerStatuses {
if cs.State.Waiting != nil {
cs.State.Waiting.Reason = waitingReason
}
}
for _, cs := range apiPodStatus.ContainerStatuses {
if cs.State.Waiting != nil {
cs.State.Waiting.Reason = waitingReason
}
}
}
并更新status manager中的状态信息,杀死不可运行的pod。
// Update status in the status manager
kl.statusManager.SetPodStatus(pod, apiPodStatus)
// Kill pod if it should not be running
if !runnable.Admit || pod.DeletionTimestamp != nil || apiPodStatus.Phase == v1.PodFailed {
var syncErr error
if err := kl.killPod(pod, nil, podStatus, nil); err != nil {
kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, "error killing pod: %v", err)
syncErr = fmt.Errorf("error killing pod: %v", err)
utilruntime.HandleError(syncErr)
} else {
if !runnable.Admit {
// There was no error killing the pod, but the pod cannot be run.
// Return an error to signal that the sync loop should back off.
syncErr = fmt.Errorf("pod cannot be run: %s", runnable.Message)
}
}
return syncErr
}
如果网络插件还没到Ready状态,则只有在使用host网络模式的情况下才启动pod。
// If the network plugin is not ready, only start the pod if it uses the host network
if rs := kl.runtimeState.networkErrors(); len(rs) != 0 && !kubecontainer.IsHostNetworkPod(pod) {
kl.recorder.Eventf(pod, v1.EventTypeWarning, events.NetworkNotReady, "%s: %v", NetworkNotReadyErrorMsg, rs)
return fmt.Errorf("%s: %v", NetworkNotReadyErrorMsg, rs)
}
2.3. Cgroups
给pod创建Cgroups,如果cgroups-per-qos参数开启,则申请相应的资源。对于terminated的pod不需要创建或更新pod的Cgroups。
当重新启动kubelet并且启用cgroups-per-qos时,应该间歇性地终止所有pod的运行容器并在qos cgroup hierarchy下重新启动。
如果pod的cgroup已经存在或者pod第一次运行,不杀死pod中容器。
// Create Cgroups for the pod and apply resource parameters
// to them if cgroups-per-qos flag is enabled.
pcm := kl.containerManager.NewPodContainerManager()
// If pod has already been terminated then we need not create
// or update the pod's cgroup
if !kl.podIsTerminated(pod) {
// When the kubelet is restarted with the cgroups-per-qos
// flag enabled, all the pod's running containers
// should be killed intermittently and brought back up
// under the qos cgroup hierarchy.
// Check if this is the pod's first sync
firstSync := true
for _, containerStatus := range apiPodStatus.ContainerStatuses {
if containerStatus.State.Running != nil {
firstSync = false
break
}
}
// Don't kill containers in pod if pod's cgroups already
// exists or the pod is running for the first time
podKilled := false
if !pcm.Exists(pod) && !firstSync {
if err := kl.killPod(pod, nil, podStatus, nil); err == nil {
podKilled = true
}
}
...
如果pod被杀死并且重启策略是Never,则不创建或更新对应的Cgroups,否则创建和更新pod的Cgroups。
// Create and Update pod's Cgroups
// Don't create cgroups for run once pod if it was killed above
// The current policy is not to restart the run once pods when
// the kubelet is restarted with the new flag as run once pods are
// expected to run only once and if the kubelet is restarted then
// they are not expected to run again.
// We don't create and apply updates to cgroup if its a run once pod and was killed above
if !(podKilled && pod.Spec.RestartPolicy == v1.RestartPolicyNever) {
if !pcm.Exists(pod) {
if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
glog.V(2).Infof("Failed to update QoS cgroups while syncing pod: %v", err)
}
if err := pcm.EnsureExists(pod); err != nil {
kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToCreatePodContainer, "unable to ensure pod container exists: %v", err)
return fmt.Errorf("failed to ensure that the pod: %v cgroups exist and are correctly applied: %v", pod.UID, err)
}
}
}
其中创建Cgroups是通过containerManager的UpdateQOSCgroups来执行。
if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
glog.V(2).Infof("Failed to update QoS cgroups while syncing pod: %v", err)
}
2.4. Mirror Pod
如果pod是一个static pod,没有对应的mirror pod,则创建一个mirror pod;如果存在mirror pod则删除再重建一个mirror pod。
// Create Mirror Pod for Static Pod if it doesn't already exist
if kubepod.IsStaticPod(pod) {
podFullName := kubecontainer.GetPodFullName(pod)
deleted := false
if mirrorPod != nil {
if mirrorPod.DeletionTimestamp != nil || !kl.podManager.IsMirrorPodOf(mirrorPod, pod) {
// The mirror pod is semantically different from the static pod. Remove
// it. The mirror pod will get recreated later.
glog.Warningf("Deleting mirror pod %q because it is outdated", format.Pod(mirrorPod))
if err := kl.podManager.DeleteMirrorPod(podFullName); err != nil {
glog.Errorf("Failed deleting mirror pod %q: %v", format.Pod(mirrorPod), err)
} else {
deleted = true
}
}
}
if mirrorPod == nil || deleted {
node, err := kl.GetNode()
if err != nil || node.DeletionTimestamp != nil {
glog.V(4).Infof("No need to create a mirror pod, since node %q has been removed from the cluster", kl.nodeName)
} else {
glog.V(4).Infof("Creating a mirror pod for static pod %q", format.Pod(pod))
if err := kl.podManager.CreateMirrorPod(pod); err != nil {
glog.Errorf("Failed creating a mirror pod for %q: %v", format.Pod(pod), err)
}
}
}
}
2.5. makePodDataDirs
给pod创建数据目录。
// Make data directories for the pod
if err := kl.makePodDataDirs(pod); err != nil {
kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToMakePodDataDirectories, "error making pod data directories: %v", err)
glog.Errorf("Unable to make pod data directories for pod %q: %v", format.Pod(pod), err)
return err
}
其中数据目录包括
PodDir:{kubelet.rootDirectory}/pods/podUIDPodVolumesDir:{PodDir}/volumesPodPluginsDir:{PodDir}/plugins
// makePodDataDirs creates the dirs for the pod datas.
func (kl *Kubelet) makePodDataDirs(pod *v1.Pod) error {
uid := pod.UID
if err := os.MkdirAll(kl.getPodDir(uid), 0750); err != nil && !os.IsExist(err) {
return err
}
if err := os.MkdirAll(kl.getPodVolumesDir(uid), 0750); err != nil && !os.IsExist(err) {
return err
}
if err := os.MkdirAll(kl.getPodPluginsDir(uid), 0750); err != nil && !os.IsExist(err) {
return err
}
return nil
}
2.6. mount volumes
对非terminated状态的pod挂载volume。
// Volume manager will not mount volumes for terminated pods
if !kl.podIsTerminated(pod) {
// Wait for volumes to attach/mount
if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to mount volumes for pod %q: %v", format.Pod(pod), err)
glog.Errorf("Unable to mount volumes for pod %q: %v; skipping pod", format.Pod(pod), err)
return err
}
}
2.7. PullSecretsForPod
获取pod的secret数据。
// Fetch the pull secrets for the pod
pullSecrets := kl.getPullSecretsForPod(pod)
getPullSecretsForPod具体实现函数如下:
// getPullSecretsForPod inspects the Pod and retrieves the referenced pull
// secrets.
func (kl *Kubelet) getPullSecretsForPod(pod *v1.Pod) []v1.Secret {
pullSecrets := []v1.Secret{}
for _, secretRef := range pod.Spec.ImagePullSecrets {
secret, err := kl.secretManager.GetSecret(pod.Namespace, secretRef.Name)
if err != nil {
glog.Warningf("Unable to retrieve pull secret %s/%s for %s/%s due to %v. The image pull may not succeed.", pod.Namespace, secretRef.Name, pod.Namespace, pod.Name, err)
continue
}
pullSecrets = append(pullSecrets, *secret)
}
return pullSecrets
}
2.8. containerRuntime.SyncPod
调用container runtime的SyncPod函数,执行相关pod操作,由此kubelet.syncPod的操作逻辑转入containerRuntime.SyncPod函数中。
// Call the container runtime's SyncPod callback
result := kl.containerRuntime.SyncPod(pod, apiPodStatus, podStatus, pullSecrets, kl.backOff)
kl.reasonCache.Update(pod.UID, result)
if err := result.Error(); err != nil {
// Do not return error if the only failures were pods in backoff
for _, r := range result.SyncResults {
if r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {
// Do not record an event here, as we keep all event logging for sync pod failures
// local to container runtime so we get better errors
return err
}
}
return nil
}
3. Runtime.SyncPod
SyncPod主要执行sync操作使得运行的pod达到期望状态的pod。主要执行以下操作:
- 计算
sandbox和container的变化。 - 必要的时候杀死pod。
- 杀死所有不需要运行的
container。 - 必要时创建
sandbox。 - 创建
init container。 - 创建正常的
container。
Runtime.SyncPod部分代码位于pkg/kubelet/kuberuntime/kuberuntime_manager.go
3.1. computePodActions
计算sandbox和container的变化。
// Step 1: Compute sandbox and container changes.
podContainerChanges := m.computePodActions(pod, podStatus)
glog.V(3).Infof("computePodActions got %+v for pod %q", podContainerChanges, format.Pod(pod))
if podContainerChanges.CreateSandbox {
ref, err := ref.GetReference(legacyscheme.Scheme, pod)
if err != nil {
glog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), err)
}
if podContainerChanges.SandboxID != "" {
m.recorder.Eventf(ref, v1.EventTypeNormal, events.SandboxChanged, "Pod sandbox changed, it will be killed and re-created.")
} else {
glog.V(4).Infof("SyncPod received new pod %q, will create a sandbox for it", format.Pod(pod))
}
}
3.2. killPodWithSyncResult
必要的时候杀死pod。
// Step 2: Kill the pod if the sandbox has changed.
if podContainerChanges.KillPod {
if !podContainerChanges.CreateSandbox {
glog.V(4).Infof("Stopping PodSandbox for %q because all other containers are dead.", format.Pod(pod))
} else {
glog.V(4).Infof("Stopping PodSandbox for %q, will start new one", format.Pod(pod))
}
killResult := m.killPodWithSyncResult(pod, kubecontainer.ConvertPodStatusToRunningPod(m.runtimeName, podStatus), nil)
result.AddPodSyncResult(killResult)
if killResult.Error() != nil {
glog.Errorf("killPodWithSyncResult failed: %v", killResult.Error())
return
}
if podContainerChanges.CreateSandbox {
m.purgeInitContainers(pod, podStatus)
}
}
3.3. killContainer
杀死所有不需要运行的container。
// Step 3: kill any running containers in this pod which are not to keep.
for containerID, containerInfo := range podContainerChanges.ContainersToKill {
glog.V(3).Infof("Killing unwanted container %q(id=%q) for pod %q", containerInfo.name, containerID, format.Pod(pod))
killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, containerInfo.name)
result.AddSyncResult(killContainerResult)
if err := m.killContainer(pod, containerID, containerInfo.name, containerInfo.message, nil); err != nil {
killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())
glog.Errorf("killContainer %q(id=%q) for pod %q failed: %v", containerInfo.name, containerID, format.Pod(pod), err)
return
}
}
3.4. createPodSandbox
必要时创建sandbox。
// Step 4: Create a sandbox for the pod if necessary.
...
glog.V(4).Infof("Creating sandbox for pod %q", format.Pod(pod))
createSandboxResult := kubecontainer.NewSyncResult(kubecontainer.CreatePodSandbox, format.Pod(pod))
result.AddSyncResult(createSandboxResult)
podSandboxID, msg, err = m.createPodSandbox(pod, podContainerChanges.Attempt)
if err != nil {
createSandboxResult.Fail(kubecontainer.ErrCreatePodSandbox, msg)
glog.Errorf("createPodSandbox for pod %q failed: %v", format.Pod(pod), err)
ref, referr := ref.GetReference(legacyscheme.Scheme, pod)
if referr != nil {
glog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), referr)
}
m.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedCreatePodSandBox, "Failed create pod sandbox: %v", err)
return
}
glog.V(4).Infof("Created PodSandbox %q for pod %q", podSandboxID, format.Pod(pod))
3.5. start init container
创建init container。
// Step 5: start the init container.
if container := podContainerChanges.NextInitContainerToStart; container != nil {
// Start the next init container.
startContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, container.Name)
result.AddSyncResult(startContainerResult)
isInBackOff, msg, err := m.doBackOff(pod, container, podStatus, backOff)
if isInBackOff {
startContainerResult.Fail(err, msg)
glog.V(4).Infof("Backing Off restarting init container %+v in pod %v", container, format.Pod(pod))
return
}
glog.V(4).Infof("Creating init container %+v in pod %v", container, format.Pod(pod))
if msg, err := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeInit); err != nil {
startContainerResult.Fail(err, msg)
utilruntime.HandleError(fmt.Errorf("init container start failed: %v: %s", err, msg))
return
}
// Successfully started the container; clear the entry in the failure
glog.V(4).Infof("Completed init container %q for pod %q", container.Name, format.Pod(pod))
}
3.6. start containers
创建正常的container。
// Step 6: start containers in podContainerChanges.ContainersToStart.
for _, idx := range podContainerChanges.ContainersToStart {
container := &pod.Spec.Containers[idx]
startContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, container.Name)
result.AddSyncResult(startContainerResult)
isInBackOff, msg, err := m.doBackOff(pod, container, podStatus, backOff)
if isInBackOff {
startContainerResult.Fail(err, msg)
glog.V(4).Infof("Backing Off restarting container %+v in pod %v", container, format.Pod(pod))
continue
}
glog.V(4).Infof("Creating container %+v in pod %v", container, format.Pod(pod))
// 通过startContainer来运行容器
if msg, err := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeRegular); err != nil {
startContainerResult.Fail(err, msg)
// known errors that are logged in other places are logged at higher levels here to avoid
// repetitive log spam
switch {
case err == images.ErrImagePullBackOff:
glog.V(3).Infof("container start failed: %v: %s", err, msg)
default:
utilruntime.HandleError(fmt.Errorf("container start failed: %v: %s", err, msg))
}
continue
}
}
4. startContainer
startContainer启动一个容器并返回是否成功。
主要包括以下几个步骤:
- 拉取镜像
- 创建容器
- 启动容器
- 运行post start lifecycle hooks(如果有设置此项)
startContainer完整代码如下:
startContainer部分代码位于pkg/kubelet/kuberuntime/kuberuntime_container.go
// startContainer starts a container and returns a message indicates why it is failed on error.
// It starts the container through the following steps:
// * pull the image
// * create the container
// * start the container
// * run the post start lifecycle hooks (if applicable)
func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, container *v1.Container, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, containerType kubecontainer.ContainerType) (string, error) {
// Step 1: pull the image.
imageRef, msg, err := m.imagePuller.EnsureImageExists(pod, container, pullSecrets)
if err != nil {
m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
return msg, err
}
// Step 2: create the container.
ref, err := kubecontainer.GenerateContainerRef(pod, container)
if err != nil {
glog.Errorf("Can't make a ref to pod %q, container %v: %v", format.Pod(pod), container.Name, err)
}
glog.V(4).Infof("Generating ref for container %s: %#v", container.Name, ref)
// For a new container, the RestartCount should be 0
restartCount := 0
containerStatus := podStatus.FindContainerStatusByName(container.Name)
if containerStatus != nil {
restartCount = containerStatus.RestartCount + 1
}
containerConfig, cleanupAction, err := m.generateContainerConfig(container, pod, restartCount, podIP, imageRef, containerType)
if cleanupAction != nil {
defer cleanupAction()
}
if err != nil {
m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
return grpc.ErrorDesc(err), ErrCreateContainerConfig
}
containerID, err := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)
if err != nil {
m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
return grpc.ErrorDesc(err), ErrCreateContainer
}
err = m.internalLifecycle.PreStartContainer(pod, container, containerID)
if err != nil {
m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Internal PreStartContainer hook failed: %v", grpc.ErrorDesc(err))
return grpc.ErrorDesc(err), ErrPreStartHook
}
m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.CreatedContainer, "Created container")
if ref != nil {
m.containerRefManager.SetRef(kubecontainer.ContainerID{
Type: m.runtimeName,
ID: containerID,
}, ref)
}
// Step 3: start the container.
err = m.runtimeService.StartContainer(containerID)
if err != nil {
m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Error: %v", grpc.ErrorDesc(err))
return grpc.ErrorDesc(err), kubecontainer.ErrRunContainer
}
m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.StartedContainer, "Started container")
// Symlink container logs to the legacy container log location for cluster logging
// support.
// TODO(random-liu): Remove this after cluster logging supports CRI container log path.
containerMeta := containerConfig.GetMetadata()
sandboxMeta := podSandboxConfig.GetMetadata()
legacySymlink := legacyLogSymlink(containerID, containerMeta.Name, sandboxMeta.Name,
sandboxMeta.Namespace)
containerLog := filepath.Join(podSandboxConfig.LogDirectory, containerConfig.LogPath)
// only create legacy symlink if containerLog path exists (or the error is not IsNotExist).
// Because if containerLog path does not exist, only dandling legacySymlink is created.
// This dangling legacySymlink is later removed by container gc, so it does not make sense
// to create it in the first place. it happens when journald logging driver is used with docker.
if _, err := m.osInterface.Stat(containerLog); !os.IsNotExist(err) {
if err := m.osInterface.Symlink(containerLog, legacySymlink); err != nil {
glog.Errorf("Failed to create legacy symbolic link %q to container %q log %q: %v",
legacySymlink, containerID, containerLog, err)
}
}
// Step 4: execute the post start hook.
if container.Lifecycle != nil && container.Lifecycle.PostStart != nil {
kubeContainerID := kubecontainer.ContainerID{
Type: m.runtimeName,
ID: containerID,
}
msg, handlerErr := m.runner.Run(kubeContainerID, pod, container, container.Lifecycle.PostStart)
if handlerErr != nil {
m.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, msg)
if err := m.killContainer(pod, kubeContainerID, container.Name, "FailedPostStartHook", nil); err != nil {
glog.Errorf("Failed to kill container %q(id=%q) in pod %q: %v, %v",
container.Name, kubeContainerID.String(), format.Pod(pod), ErrPostStartHook, err)
}
return msg, fmt.Errorf("%s: %v", ErrPostStartHook, handlerErr)
}
}
return "", nil
}
以下对startContainer分段分析:
4.1. pull image
通过EnsureImageExists方法拉取拉取指定pod容器的镜像,并返回镜像信息和错误。
// Step 1: pull the image.
imageRef, msg, err := m.imagePuller.EnsureImageExists(pod, container, pullSecrets)
if err != nil {
m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
return msg, err
}
4.2. CreateContainer
首先生成container的*v1.ObjectReference对象,该对象包括container的相关信息。
// Step 2: create the container.
ref, err := kubecontainer.GenerateContainerRef(pod, container)
if err != nil {
glog.Errorf("Can't make a ref to pod %q, container %v: %v", format.Pod(pod), container.Name, err)
}
glog.V(4).Infof("Generating ref for container %s: %#v", container.Name, ref)
统计container的重启次数,新的容器默认重启次数为0。
// For a new container, the RestartCount should be 0
restartCount := 0
containerStatus := podStatus.FindContainerStatusByName(container.Name)
if containerStatus != nil {
restartCount = containerStatus.RestartCount + 1
}
生成container的配置。
containerConfig, cleanupAction, err := m.generateContainerConfig(container, pod, restartCount, podIP, imageRef, containerType)
if cleanupAction != nil {
defer cleanupAction()
}
if err != nil {
m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
return grpc.ErrorDesc(err), ErrCreateContainerConfig
}
调用runtimeService,执行CreateContainer的操作。
containerID, err := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)
if err != nil {
m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
return grpc.ErrorDesc(err), ErrCreateContainer
}
err = m.internalLifecycle.PreStartContainer(pod, container, containerID)
if err != nil {
m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Internal PreStartContainer hook failed: %v", grpc.ErrorDesc(err))
return grpc.ErrorDesc(err), ErrPreStartHook
}
m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.CreatedContainer, "Created container")
if ref != nil {
m.containerRefManager.SetRef(kubecontainer.ContainerID{
Type: m.runtimeName,
ID: containerID,
}, ref)
}
4.3. StartContainer
执行runtimeService的StartContainer方法,来启动容器。
// Step 3: start the container.
err = m.runtimeService.StartContainer(containerID)
if err != nil {
m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Error: %v", grpc.ErrorDesc(err))
return grpc.ErrorDesc(err), kubecontainer.ErrRunContainer
}
m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.StartedContainer, "Started container")
// Symlink container logs to the legacy container log location for cluster logging
// support.
// TODO(random-liu): Remove this after cluster logging supports CRI container log path.
containerMeta := containerConfig.GetMetadata()
sandboxMeta := podSandboxConfig.GetMetadata()
legacySymlink := legacyLogSymlink(containerID, containerMeta.Name, sandboxMeta.Name,
sandboxMeta.Namespace)
containerLog := filepath.Join(podSandboxConfig.LogDirectory, containerConfig.LogPath)
// only create legacy symlink if containerLog path exists (or the error is not IsNotExist).
// Because if containerLog path does not exist, only dandling legacySymlink is created.
// This dangling legacySymlink is later removed by container gc, so it does not make sense
// to create it in the first place. it happens when journald logging driver is used with docker.
if _, err := m.osInterface.Stat(containerLog); !os.IsNotExist(err) {
if err := m.osInterface.Symlink(containerLog, legacySymlink); err != nil {
glog.Errorf("Failed to create legacy symbolic link %q to container %q log %q: %v",
legacySymlink, containerID, containerLog, err)
}
}
4.4. execute post start hook
如果有指定Lifecycle.PostStart,则执行PostStart操作,PostStart如果执行失败,则容器会根据重启的规则进行重启。
// Step 4: execute the post start hook.
if container.Lifecycle != nil && container.Lifecycle.PostStart != nil {
kubeContainerID := kubecontainer.ContainerID{
Type: m.runtimeName,
ID: containerID,
}
msg, handlerErr := m.runner.Run(kubeContainerID, pod, container, container.Lifecycle.PostStart)
if handlerErr != nil {
m.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, msg)
if err := m.killContainer(pod, kubeContainerID, container.Name, "FailedPostStartHook", nil); err != nil {
glog.Errorf("Failed to kill container %q(id=%q) in pod %q: %v, %v",
container.Name, kubeContainerID.String(), format.Pod(pod), ErrPostStartHook, err)
}
return msg, fmt.Errorf("%s: %v", ErrPostStartHook, handlerErr)
}
}
5. 总结
kubelet的工作是管理pod在Node上的生命周期(包括增删改查),kubelet通过各种类型的manager异步工作各自执行各自的任务,其中使用到了多种的channel来控制状态信号变化的传递,例如比较重要的channel有podUpdates <-chan UpdatePodOptions,来传递pod的变化情况。
创建pod的调用逻辑
syncLoopIteration-->kubetypes.ADD-->HandlePodAdditions(u.Pods)-->dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)-->podWorkers.UpdatePod-->managePodLoop(podUpdates)-->syncPod(o syncPodOptions)-->containerRuntime.SyncPod-->startContainer
参考:
12 - Runtime
12.1 - Runc和Containerd概述
本文主要分析OCI,CRI,runc,containerd,cri-containerd,dockershim等组件说明及调用关系。
1. 概述
各个组件调用关系图如下:

图片来源:https://www.jianshu.com/p/62e71584d1cb
2. OCI(Open Container Initiative)
OCI(Open Container Initiative)即开放的容器运行时规范,目的在于定义一个容器运行时及镜像的相关标准和规范,其中包括
- runtime-spec:容器的生命周期管理,具体参考runtime-spec。
- image-spec:镜像的生命周期管理,具体参考image-spec。
实现OCI标准的容器运行时有runc,kata等。
3. RunC
runc(run container)是一个基于OCI标准实现的一个轻量级容器运行工具,用来创建和运行容器。而Containerd是用来维持通过runc创建的容器的运行状态。即runc用来创建和运行容器,containerd作为常驻进程用来管理容器。
runc包含libcontainer,包括对namespace和cgroup的调用操作。
命令参数:
To start a new instance of a container:
# runc run [ -b bundle ] <container-id>
USAGE:
runc [global options] command [command options] [arguments...]
COMMANDS:
checkpoint checkpoint a running container
create create a container
delete delete any resources held by the container often used with detached container
events display container events such as OOM notifications, cpu, memory, and IO usage statistics
exec execute new process inside the container
init initialize the namespaces and launch the process (do not call it outside of runc)
kill kill sends the specified signal (default: SIGTERM) to the container's init process
list lists containers started by runc with the given root
pause pause suspends all processes inside the container
ps ps displays the processes running inside a container
restore restore a container from a previous checkpoint
resume resumes all processes that have been previously paused
run create and run a container
spec create a new specification file
start executes the user defined process in a created container
state output the state of a container
update update container resource constraints
help, h Shows a list of commands or help for one command
4. Containerd
containerd(container daemon)是一个daemon进程用来管理和运行容器,可以用来拉取/推送镜像和管理容器的存储和网络。其中可以调用runc来创建和运行容器。
4.1. containerd的架构图

4.2. docker与containerd、runc的关系图

更具体的调用逻辑:

5. CRI(Container Runtime Interface )
CRI即容器运行时接口,主要用来定义k8s与容器运行时的API调用,kubelet通过CRI来调用容器运行时,只要实现了CRI接口的容器运行时就可以对接到k8s的kubelet组件。

5.1. docker与k8s调用containerd的关系图

5.2. cri-api
5.2.1. runtime service
// Runtime service defines the public APIs for remote container runtimes
service RuntimeService {
// Version returns the runtime name, runtime version, and runtime API version.
rpc Version(VersionRequest) returns (VersionResponse) {}
// RunPodSandbox creates and starts a pod-level sandbox. Runtimes must ensure
// the sandbox is in the ready state on success.
rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}
// StopPodSandbox stops any running process that is part of the sandbox and
// reclaims network resources (e.g., IP addresses) allocated to the sandbox.
// If there are any running containers in the sandbox, they must be forcibly
// terminated.
// This call is idempotent, and must not return an error if all relevant
// resources have already been reclaimed. kubelet will call StopPodSandbox
// at least once before calling RemovePodSandbox. It will also attempt to
// reclaim resources eagerly, as soon as a sandbox is not needed. Hence,
// multiple StopPodSandbox calls are expected.
rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
// RemovePodSandbox removes the sandbox. If there are any running containers
// in the sandbox, they must be forcibly terminated and removed.
// This call is idempotent, and must not return an error if the sandbox has
// already been removed.
rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}
// PodSandboxStatus returns the status of the PodSandbox. If the PodSandbox is not
// present, returns an error.
rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
// ListPodSandbox returns a list of PodSandboxes.
rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}
// CreateContainer creates a new container in specified PodSandbox
rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
// StartContainer starts the container.
rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
// StopContainer stops a running container with a grace period (i.e., timeout).
// This call is idempotent, and must not return an error if the container has
// already been stopped.
// The runtime must forcibly kill the container after the grace period is
// reached.
rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
// RemoveContainer removes the container. If the container is running, the
// container must be forcibly removed.
// This call is idempotent, and must not return an error if the container has
// already been removed.
rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
// ListContainers lists all containers by filters.
rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
// ContainerStatus returns status of the container. If the container is not
// present, returns an error.
rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
// UpdateContainerResources updates ContainerConfig of the container.
rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}
// ReopenContainerLog asks runtime to reopen the stdout/stderr log file
// for the container. This is often called after the log file has been
// rotated. If the container is not running, container runtime can choose
// to either create a new log file and return nil, or return an error.
// Once it returns error, new container log file MUST NOT be created.
rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}
// ExecSync runs a command in a container synchronously.
rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse) {}
// Exec prepares a streaming endpoint to execute a command in the container.
rpc Exec(ExecRequest) returns (ExecResponse) {}
// Attach prepares a streaming endpoint to attach to a running container.
rpc Attach(AttachRequest) returns (AttachResponse) {}
// PortForward prepares a streaming endpoint to forward ports from a PodSandbox.
rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}
// ContainerStats returns stats of the container. If the container does not
// exist, the call returns an error.
rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse) {}
// ListContainerStats returns stats of all running containers.
rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse) {}
// UpdateRuntimeConfig updates the runtime configuration based on the given request.
rpc UpdateRuntimeConfig(UpdateRuntimeConfigRequest) returns (UpdateRuntimeConfigResponse) {}
// Status returns the status of the runtime.
rpc Status(StatusRequest) returns (StatusResponse) {}
}
5.2.2. image service
// ImageService defines the public APIs for managing images.
service ImageService {
// ListImages lists existing images.
rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}
// ImageStatus returns the status of the image. If the image is not
// present, returns a response with ImageStatusResponse.Image set to
// nil.
rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}
// PullImage pulls an image with authentication config.
rpc PullImage(PullImageRequest) returns (PullImageResponse) {}
// RemoveImage removes the image.
// This call is idempotent, and must not return an error if the image has
// already been removed.
rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}
// ImageFSInfo returns information of the filesystem that is used to store images.
rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse) {}
}
5.3. cri-containerd

5.3.1. CRI Plugin调用流程
- kubelet调用CRI插件,通过CRI Runtime Service接口创建pod
- cri通过CNI接口创建和配置pod的network namespace
- cri调用containerd创建sandbox container(pause container )并将容器放入pod的cgroup和namespace中
- kubelet调用CRI插件,通过image service接口拉取镜像,接着通过containerd来拉取镜像
- kubelet调用CRI插件,通过runtime service接口运行拉取下来的镜像服务,最后通过containerd来运行业务容器,并将容器放入pod的cgroup和namespace中。
具体参考:https://github.com/containerd/cri/blob/release/1.4/docs/architecture.md
5.3.2. k8s对runtime调用的演进
由原来通过dockershim调用docker再调用containerd,直接变成通过cri-containerd调用containerd,从而减少了一层docker调用逻辑。

具体参考:https://github.com/containerd/cri/blob/release/1.4/docs/proposal.md
5.4. Dockershim
在旧版本的k8s中,由于docker没有实现CRI接口,因此增加一个Dockershim来实现k8s对docker的调用。(shim:垫片,一般用来表示对第三方组件API调用的适配插件,例如k8s使用Dockershim来实现对docker接口的适配调用)
5.5. CRI-O
cri-o与containerd类似,用来实现容器的管理,可替换containerd的使用。
参考:
- https://opencontainers.org/about/overview/
- https://github.com/opencontainers/runtime-spec
- https://github.com/kubernetes/kubernetes/blob/242a97307b34076d5d8f5bbeb154fa4d97c9ef1d/docs/devel/container-runtime-interface.md
- https://github.com/containerd/containerd/blob/main/docs/cri/architecture.md
- https://www.tutorialworks.com/difference-docker-containerd-runc-crio-oci/
- https://kubernetes.io/zh/docs/setup/production-environment/container-runtimes/
12.2 - Containerd
12.2.1 - 安装Containerd
1. Ubuntu安装containerd
以下以Ubuntu为例
说明:安装containerd与安装docker流程基本一致,差别在于不需要安装docker-ce
containerd: apt-get install -y containerd.iodocker: apt-get install docker-ce docker-ce-cli containerd.io
1. 卸载旧版本
sudo apt-get remove docker docker-engine docker.io containerd runc
如果需要删除镜像及容器数据则执行以下命令
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
2. 准备包环境
1、更新apt,允许使用https。
sudo apt-get update
sudo apt-get install \
ca-certificates \
curl \
gnupg \
lsb-release
2、添加docker官方GPG key。
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
3、设置软件仓库源
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
3. 安装containerd
# 安装containerd
sudo apt-get update
sudo apt-get install -y containerd.io
# 如果是安装docker则执行:
sudo apt-get install docker-ce docker-ce-cli containerd.io
# 查看运行状态
systemctl enable containerd
systemctl status containerd
安装指定版本
# 查看版本
apt-cache madison containerd
# sudo apt-get install containerd=<VERSION>
4. 修改配置
在 Linux 上,containerd 的默认 CRI 套接字是 /run/containerd/containerd.sock。
1、生成默认配置
containerd config default > /etc/containerd/config.toml
2、修改CgroupDriver为systemd
k8s官方推荐使用systemd类型的CgroupDriver。
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
...
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
3、重启containerd
systemctl restart containerd
2. 离线二进制安装containerd
把containerd、runc、cni-plugins、nerdctl二进制下载到本地,再上传到对应服务器,解压文件到对应目录,修改containerd配置文件,启动containerd。
#!/bin/bash
set -e
ContainerdVersion=$1
ContainerdVersion=${ContainerdVersion:-1.6.6}
RuncVersion=$2
RuncVersion=${RuncVersion:-1.1.3}
CniVersion=$3
CniVersion=${CniVersion:-1.1.1}
NerdctlVersion=$4
NerdctlVersion=${NerdctlVersion:-0.21.0}
CrictlVersion=$5
CrictlVersion=${CrictlVersion:-1.24.2}
echo "--------------install containerd--------------"
wget https://github.com/containerd/containerd/releases/download/v${ContainerdVersion}/containerd-${ContainerdVersion}-linux-amd64.tar.gz
tar Cxzvf /usr/local containerd-${ContainerdVersion}-linux-amd64.tar.gz
echo "--------------install containerd service--------------"
wget https://raw.githubusercontent.com/containerd/containerd/681aaf68b7dcbe08a51c3372cbb8f813fb4466e0/containerd.service
mv containerd.service /lib/systemd/system/
mkdir -p /etc/containerd/
containerd config default > /etc/containerd/config.toml
echo "--------------install runc--------------"
wget https://github.com/opencontainers/runc/releases/download/v${RuncVersion}/runc.amd64
chmod +x runc.amd64
mv runc.amd64 /usr/local/bin/runc
echo "--------------install cni plugins--------------"
wget https://github.com/containernetworking/plugins/releases/download/v${CniVersion}/cni-plugins-linux-amd64-v${CniVersion}.tgz
rm -fr /opt/cni/bin
mkdir -p /opt/cni/bin
tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v${CniVersion}.tgz
echo "--------------install nerdctl--------------"
wget https://github.com/containerd/nerdctl/releases/download/v${NerdctlVersion}/nerdctl-${NerdctlVersion}-linux-amd64.tar.gz
tar Cxzvf /usr/local/bin nerdctl-${NerdctlVersion}-linux-amd64.tar.gz
echo "--------------install crictl--------------"
wget https://github.com/kubernetes-sigs/cri-tools/releases/download/v${CrictlVersion}/crictl-v${CrictlVersion}-linux-amd64.tar.gz
tar Cxzvf /usr/local/bin crictl-v${CrictlVersion}-linux-amd64.tar.gz
# 启动containerd服务
systemctl daemon-reload
systemctl restart contaienrd
参考:
12.2.2 -
crictl
#!/bin/bash
CrictlVersion=$5
CrictlVersion=${CrictlVersion:-0.21.0}
echo "--------------install crictl--------------"
wget https://github.com/kubernetes-sigs/cri-tools/releases/download/v${CrictlVersion}/crictl-v${CrictlVersion}-linux-amd64.tar.gz
tar Cxzvf /usr/local/bin nerdctl-${NerdctlVersion}-linux-amd64.tar.gz
设置配置文件
cat > /etc/crictl.yaml << EOF
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 2
debug: false
pull-image-on-create: false
EOF
12.2.3 -
参考:
12.3 - Docker
12.3.2 -
1. 创建Docker Client
Docker是一个client/server的架构,通过二进制文件docker创建Docker客户端将请求类型与参数发送给Docker Server,Docker Server具体执行命令调用。
Docker Client运行流程图如下:

说明:本文分析的代码为Docker 1.2.0版本。
1.1. Docker命令flag参数解析
Docker Server与Docker Client由可执行文件docker命令创建并启动。
- Docker Server的启动:docker -d或docker --daemon=true
- Docker Client的启动:docker --daemon=false ps等
docker参数分为两类:
- 命令行参数(flag参数):--daemon=true,-d
- 实际请求参数:ps ,images, pull, push等
/docker/docker.go
func main() {
if reexec.Init() {
return
}
flag.Parse()
// FIXME: validate daemon flags here
......
}
reexec.Init()作用:协调execdriver与容器创建时dockerinit的关系。如果返回值为真则直接退出运行,否则继续执行。判断reexec.Init()之后,调用flag.Parse()解析命令行中的flag参数。
/docker/flag.go
var (
flVersion = flag.Bool([]string{"v", "-version"}, false, "Print version information and quit")
flDaemon = flag.Bool([]string{"d", "-daemon"}, false, "Enable daemon mode")
flDebug = flag.Bool([]string{"D", "-debug"}, false, "Enable debug mode")
flSocketGroup = flag.String([]string{"G", "-group"}, "docker", "Group to assign the unix socket specified by -H when running in daemon mode/nuse '' (the empty string) to disable setting of a group")
flEnableCors = flag.Bool([]string{"#api-enable-cors", "-api-enable-cors"}, false, "Enable CORS headers in the remote API")
flTls = flag.Bool([]string{"-tls"}, false, "Use TLS; implied by tls-verify flags")
flTlsVerify = flag.Bool([]string{"-tlsverify"}, false, "Use TLS and verify the remote (daemon: verify client, client: verify daemon)")
// these are initialized in init() below since their default values depend on dockerCertPath which isn't fully initialized until init() runs
flCa *string
flCert *string
flKey *string
flHosts []string
)
func init() {
flCa = flag.String([]string{"-tlscacert"}, filepath.Join(dockerCertPath, defaultCaFile), "Trust only remotes providing a certificate signed by the CA given here")
flCert = flag.String([]string{"-tlscert"}, filepath.Join(dockerCertPath, defaultCertFile), "Path to TLS certificate file")
flKey = flag.String([]string{"-tlskey"}, filepath.Join(dockerCertPath, defaultKeyFile), "Path to TLS key file")
opts.HostListVar(&flHosts, []string{"H", "-host"}, "The socket(s) to bind to in daemon mode/nspecified using one or more tcp://host:port, unix:///path/to/socket, fd://* or fd://socketfd.")
}
flag.go定义了flag参数,并执行了init的初始化。
Go中的init函数
- 用于程序执行前包的初始化工作,比如初始化变量
- 每个包或源文件可以包含多个init函数
- init函数不能被调用,而是在mian函数调用前自动被调用
- 不同init函数的执行顺序,按照包导入的顺序执行
当解析到第一个非flag参数时,flag解析工作就结束。例如docker --daemon=flase --version=false ps
- 完成flag的解析,--daemon=false
- 遇到第一个非flag参数ps,则将ps及其后的参数存入flag.Args(),以便执行之后的具体请求。
1.2. 处理flag参数并收集Docker Client的配置信息
处理的flag参数有flVersion,flDebug,flDaemon,flTlsVerify以及flTls。
/docker/docker.go
func main() {
......
if len(flHosts) == 0 {
defaultHost := os.Getenv("DOCKER_HOST")
if defaultHost == "" || *flDaemon {
// If we do not have a host, default to unix socket
defaultHost = fmt.Sprintf("unix://%s", api.DEFAULTUNIXSOCKET)
}
if _, err := api.ValidateHost(defaultHost); err != nil {
log.Fatal(err)
}
flHosts = append(flHosts, defaultHost)
}
......
}
flHosts的作用是为Docker Client提供所要连接的host对象,也就是为Docker Server提供所要监听的对象。 当flHosts为空,默认取环境变量DOCKER_HOST,若仍为空或flDaemon为真,则设置为unix socket,值为unix:///var/run/docker.sock。取自/api/common.go中的常量DEFAULTUNIXSOCKET。
/docker/docker.go
func main() {
...
if *flDaemon {
mainDaemon()
return
}
...
}
若flDaemon为真,表示启动Docker Daemon,调用/docker/daemon.go中的func mainDaemon()。
/docker/docker.go
if len(flHosts) > 1 {
log.Fatal("Please specify only one -H")
}
protoAddrParts := strings.SplitN(flHosts[0], "://", 2)
protoAddrParts的作用是解析出Docker Client 与Docker Server建立通信的协议与地址,通过strings.SplitN函数分割存储。flHosts[0]的值可以是tcp://0.0.0.0.2375或者unix:///var/run/docker.sock等。
/docker/docker.go
var (
cli *client.DockerCli
tlsConfig tls.Config
)
tlsConfig.InsecureSkipVerify = true
tlsConfig对象的创建是为了保障cli在传输数据的时候遵循安全传输层协议(TLS)。flTlsVerity参数为真,则说明Docker Client 需Docker Server一起验证连接的安全性,如果flTls和flTlsVerity两个参数中有一个为真,则说明需要加载并发送客户端的证书。
/docker/flags.go
flTls = flag.Bool([]string{"-tls"}, false, "Use TLS; implied by tls-verify flags")
flTlsVerify = flag.Bool([]string{"-tlsverify"}, false, "Use TLS and verify the remote (daemon: verify client, client: verify daemon)")
1.3. 如何创建Docker Client
/docker/docker.go
if *flTls || *flTlsVerify {
cli = client.NewDockerCli(os.Stdin, os.Stdout, os.Stderr, protoAddrParts[0], protoAddrParts[1], &tlsConfig)
} else {
cli = client.NewDockerCli(os.Stdin, os.Stdout, os.Stderr, protoAddrParts[0], protoAddrParts[1], nil)
}
在已有配置参数的情况下,通过/api/client/cli.go中的NewDockerCli方法创建Docker Client实例cli。
/api/client/cli.go
type DockerCli struct {
proto string
addr string
configFile *registry.ConfigFile
in io.ReadCloser
out io.Writer
err io.Writer
isTerminal bool
terminalFd uintptr
tlsConfig *tls.Config
scheme string
}
func NewDockerCli(in io.ReadCloser, out, err io.Writer, proto, addr string, tlsConfig *tls.Config) *DockerCli {
var (
isTerminal = false
terminalFd uintptr
scheme = "http"
)
if tlsConfig != nil {
scheme = "https"
}
if in != nil {
if file, ok := out.(*os.File); ok {
terminalFd = file.Fd()
isTerminal = term.IsTerminal(terminalFd)
}
}
if err == nil {
err = out
}
return &DockerCli{
proto: proto,
addr: addr,
in: in,
out: out,
err: err,
isTerminal: isTerminal,
terminalFd: terminalFd,
tlsConfig: tlsConfig,
scheme: scheme,
}
}
2. Docke命令执行
2.1. Docker Client解析请求命令
创建Docker Client,docker命令中的请求参数(例如ps,经flag解析后放入flag.Args()),分析请求参数及请求的类型,转义为Docker Server可识别的请求后发给Docker Server。
/docker/docker.go
if err := cli.Cmd(flag.Args()...); err != nil {
if sterr, ok := err.(*utils.StatusError); ok {
if sterr.Status != "" {
log.Println(sterr.Status)
}
os.Exit(sterr.StatusCode)
}
log.Fatal(err)
}
解析flag.Args()的具体请求参数,执行cli.Cmd函数。代码在/api/client/cli.go
/api/client/cli.go
// Cmd executes the specified command
func (cli *DockerCli) Cmd(args ...string) error {
if len(args) > 0 {
method, exists := cli.getMethod(args[0])
if !exists {
fmt.Println("Error: Command not found:", args[0])
return cli.CmdHelp(args[1:]...)
}
return method(args[1:]...)
}
return cli.CmdHelp(args...)
}
method, exists := cli.getMethod(args[0])获取请求参数,例如docker pull ImageName,args[0]等于pull。
func (cli *DockerCli) getMethod(name string) (func(...string) error, bool) {
if len(name) == 0 {
return nil, false
}
methodName := "Cmd" + strings.ToUpper(name[:1]) + strings.ToLower(name[1:])
method := reflect.ValueOf(cli).MethodByName(methodName)
if !method.IsValid() {
return nil, false
}
return method.Interface().(func(...string) error), true
}
在getMethod中,返回method值为“CmdPull”。最后执行method(args[1:]...),即CmdPull(args[1:]...)。
2.2. Docker Client执行请求命令
docker pull ImageName中,即执行CmdPull(args[1:]...),args[1:]即为ImageName。命令代码在/api/client/command.go。
/api/client/commands.go
func (cli *DockerCli) CmdPull(args ...string) error {
cmd := cli.Subcmd("pull", "NAME[:TAG]", "Pull an image or a repository from the registry")
tag := cmd.String([]string{"#t", "#-tag"}, "", "Download tagged image in a repository")
if err := cmd.Parse(args); err != nil {
return nil
}
...
}
将args参数进行第二次flag参数解析,解析过程中先提取是否有符合tag这个flag的参数,若有赋值给tag参数,其余存入cmd.NArg(),若没有则所有的参数存入cmd.NArg()中。
/api/client/commands.go
var (
v = url.Values{}
remote = cmd.Arg(0)
)
v.Set("fromImage", remote)
if *tag == "" {
v.Set("tag", *tag)
}
remote, _ = parsers.ParseRepositoryTag(remote)
// Resolve the Repository name from fqn to hostname + name
hostname, _, err := registry.ResolveRepositoryName(remote)
if err != nil {
return err
}
通过remote变量先得到镜像的repository名称,并赋值给remote自身,随后解析改变后的remote,得出镜像所在的host地址,即Docker Registry的地址。若没有指定默认为Docker Hub地址https://index.docker.io/v1/。
/api/client/commands.go
cli.LoadConfigFile()
// Resolve the Auth config relevant for this server
authConfig := cli.configFile.ResolveAuthConfig(hostname)
通过cli对象获取与Docker Server的认证配置信息。
/api/client/commands.go
pull := func(authConfig registry.AuthConfig) error {
buf, err := json.Marshal(authConfig)
if err != nil {
return err
}
registryAuthHeader := []string{
base64.URLEncoding.EncodeToString(buf),
}
return cli.stream("POST", "/images/create?"+v.Encode(), nil, cli.out, map[string][]string{
"X-Registry-Auth": registryAuthHeader,
})
}
定义pull函数:cli.stream("POST", "/images/create?"+v.Encode(),...)像Docker Server发送POST请求,请求url为“"/images/create?"+v.Encode()”,请求的认证信息为:map[string][]string{"X-Registry-Auth": registryAuthHeader,}
/api/client/commands.go
if err := pull(authConfig); err != nil {
if strings.Contains(err.Error(), "Status 401") {
fmt.Fprintln(cli.out, "/nPlease login prior to pull:")
if err := cli.CmdLogin(hostname); err != nil {
return err
}
authConfig := cli.configFile.ResolveAuthConfig(hostname)
return pull(authConfig)
}
return err
}
return nil
调用pull函数,实现下载请求发送。后续有Docker Server接收到请求后具体实现。
参考:
- 《Docker源码分析》
12.3.3 -
1. Docker Daemon架构示意图
Docker Daemon是Docker架构中运行在后台的守护进程,大致可以分为Docker Server、Engine和Job三部分。
Docker Daemon可以认为是通过Docker Server模块接受Docker Client的请求,并在Engine中处理请求,然后根据请求类型,创建出指定的Job并运行。
运行过程的作用有以下几种可能:
- 向Docker Registry获取镜像,
- 通过graphdriver执行容器镜像的本地化操作,
- 通过networkdriver执行容器网络环境的配置,
- 通过execdriver执行容器内部运行的执行工作等。
说明:本文分析的代码为Docker 1.2.0版本。
2. Docker Daemon启动流程图
启动Docker Daemon时,一般可以使用以下命令:docker --daemon=true; docker –d; docker –d=true等。接着由docker的main()函数来解析以上命令的相应flag参数,并最终完成Docker Daemon的启动。
/docker/docker.go
func main() {
...
if *flDaemon {
mainDaemon()
return
}
...
}
3. mainDaemon的具体实现
宏观来讲,mainDaemon()完成创建一个daemon进程,并使其正常运行。
从功能的角度来说,mainDaemon()实现了两部分内容:
- 第一,创建Docker运行环境;
- 第二,服务于Docker Client,接收并处理相应请求。
3.1. 配置初始化
/docker/daemon.go
var (
daemonCfg = &daemon.Config{}
)
func init() {
daemonCfg.InstallFlags()
}
在mainDaemon()运行之前,关于Docker Daemon所需要的config配置信息均已经初始化完毕。
声明一个为daemon包中Config类型的变量,名为daemonCfg。而Config对象,定义了Docker Daemon所需的配置信息。在Docker Daemon在启动时,daemonCfg变量被传递至Docker Daemon并被使用。
/daemon/config.go
type Config struct {
Pidfile string //Docker Daemon所属进程的PID文件
Root string //Docker运行时所使用的root路径
AutoRestart bool //已被启用,转而支持docker run时的重启
Dns []string //Docker使用的DNS Server地址
DnsSearch []string //Docker使用的指定的DNS查找域名
Mirrors []string //指定的优先Docker Registry镜像
EnableIptables bool //启用Docker的iptables功能
EnableIpForward bool //启用net.ipv4.ip_forward功能
EnableIpMasq bool //启用IP伪装技术
DefaultIp net.IP //绑定容器端口时使用的默认IP
BridgeIface string //添加容器网络至已有的网桥
BridgeIP string //创建网桥的IP地址
FixedCIDR string //指定IP的IPv4子网,必须被网桥子网包含
InterContainerCommunication bool //是否允许相同host上容器间的通信
GraphDriver string //Docker运行时使用的特定存储驱动
GraphOptions []string //可设置的存储驱动选项
ExecDriver string // Docker运行时使用的特定exec驱动
Mtu int //设置容器网络的MTU
DisableNetwork bool //有定义,之后未初始化
EnableSelinuxSupport bool //启用SELinux功能的支持
Context map[string][]string //有定义,之后未初始化
}
init()函数实现了daemonCfg变量中各属性的赋值,具体的实现为:daemonCfg.InstallFlags()
/daemon/config.go
// InstallFlags adds command-line options to the top-level flag parser for
// the current process.
// Subsequent calls to `flag.Parse` will populate config with values parsed
// from the command-line.
func (config *Config) InstallFlags() {
flag.StringVar(&config.Pidfile, []string{"p", "-pidfile"}, "/var/run/docker.pid", "Path to use for daemon PID file")
flag.StringVar(&config.Root, []string{"g", "-graph"}, "/var/lib/docker", "Path to use as the root of the Docker runtime")
flag.BoolVar(&config.AutoRestart, []string{"#r", "#-restart"}, true, "--restart on the daemon has been deprecated infavor of --restart policies on docker run")
flag.BoolVar(&config.EnableIptables, []string{"#iptables", "-iptables"}, true, "Enable Docker's addition of iptables rules")
flag.BoolVar(&config.EnableIpForward, []string{"#ip-forward", "-ip-forward"}, true, "Enable net.ipv4.ip_forward")
flag.StringVar(&config.BridgeIP, []string{"#bip", "-bip"}, "", "Use this CIDR notation address for the network bridge's IP, not compatible with -b")
flag.StringVar(&config.BridgeIface, []string{"b", "-bridge"}, "", "Attach containers to a pre-existing network bridge/nuse 'none' to disable container networking")
flag.BoolVar(&config.InterContainerCommunication, []string{"#icc", "-icc"}, true, "Enable inter-container communication")
flag.StringVar(&config.GraphDriver, []string{"s", "-storage-driver"}, "", "Force the Docker runtime to use a specific storage driver")
flag.StringVar(&config.ExecDriver, []string{"e", "-exec-driver"}, "native", "Force the Docker runtime to use a specific exec driver")
flag.BoolVar(&config.EnableSelinuxSupport, []string{"-selinux-enabled"}, false, "Enable selinux support. SELinux does not presently support the BTRFS storage driver")
flag.IntVar(&config.Mtu, []string{"#mtu", "-mtu"}, 0, "Set the containers network MTU/nif no value is provided: default to the default route MTU or 1500 if no default route is available")
opts.IPVar(&config.DefaultIp, []string{"#ip", "-ip"}, "0.0.0.0", "Default IP address to use when binding container ports")
opts.ListVar(&config.GraphOptions, []string{"-storage-opt"}, "Set storage driver options")
// FIXME: why the inconsistency between "hosts" and "sockets"?
opts.IPListVar(&config.Dns, []string{"#dns", "-dns"}, "Force Docker to use specific DNS servers")
opts.DnsSearchListVar(&config.DnsSearch, []string{"-dns-search"}, "Force Docker to use specific DNS search domains")
}
在InstallFlags()函数的实现过程中,主要是定义某种类型的flag参数,并将该参数的值绑定在config变量的指定属性上,如:
flag.StringVar(&config.Pidfile, []string{"p", "-pidfile"}, " /var/run/docker.pid", "Path to use for daemon PID file")
以上语句的含义为:
- 定义一个为String类型的flag参数;
- 该flag的名称为”p”或者”-pidfile”;
- 该flag的值为” /var/run/docker.pid”,并将该值绑定在变量config.Pidfile上;
- 该flag的描述信息为"Path to use for daemon PID file"。
3.2. flag参数检查
/docker/daemon.go
if flag.NArg() != 0 {
flag.Usage()
return
}
- 参数个数不为0,则说明在启动Docker Daemon的时候,传入了多余的参数,此时会输出错误提示,并退出运行程序。
- 若为0,则说明Docker Daemon的启动命令无误,正常运行。
3.3. 创建engine对象
/docker/daemon.go
eng := engine.New()
Engine是Docker架构中的运行引擎,同时也是Docker运行的核心模块。Engine扮演着Docker container存储仓库的角色,并且通过job的形式来管理这些容器。
/engine/engine.go
type Engine struct {
handlers map[string]Handler
catchall Handler
hack Hack // data for temporary hackery (see hack.go)
id string
Stdout io.Writer
Stderr io.Writer
Stdin io.Reader
Logging bool
tasks sync.WaitGroup
l sync.RWMutex // lock for shutdown
shutdown bool
onShutdown []func() // shutdown handlers
}
Engine结构体中最为重要的即为handlers属性。该handlers属性为map类型,key为string类型,value为Handler类型。Handler为一个定义的函数。该函数传入的参数为Job指针,返回为Status状态。
/engine/engine.go
type Handler func(*Job) Status
New()函数的实现:
/engine/engine.go
// New initializes a new engine.
func New() *Engine {
eng := &Engine{
handlers: make(map[string]Handler),
id: utils.RandomString(),
Stdout: os.Stdout,
Stderr: os.Stderr,
Stdin: os.Stdin,
Logging: true,
}
eng.Register("commands", func(job *Job) Status {
for _, name := range eng.commands() {
job.Printf("%s/n", name)
}
return StatusOK
})
// Copy existing global handlers
for k, v := range globalHandlers {
eng.handlers[k] = v
}
return eng
}
- 创建一个Engine结构体实例eng
- 向eng对象注册名为commands的Handler,其中Handler为临时定义的函数func(job *Job) Status{ } , 该函数的作用是通过job来打印所有已经注册完毕的command名称,最终返回状态StatusOK。
- 将已定义的变量globalHandlers中的所有的Handler,都复制到eng对象的handlers属性中。最后成功返回eng对象。
3.4. 设置engine的信号捕获
/daemon/daemon.go
signal.Trap(eng.Shutdown)
在Docker Daemon的运行中,设置Trap特定信号的处理方法,特定信号有SIGINT,SIGTERM以及SIGQUIT;当程序捕获到SIGINT或者SIGTERM信号时,执行相应的善后操作,最后保证Docker Daemon程序退出。
/pkg/signal/trap.go
//Trap sets up a simplified signal "trap", appropriate for common
// behavior expected from a vanilla unix command-line tool in general
// (and the Docker engine in particular).
//
// * If SIGINT or SIGTERM are received, `cleanup` is called, then the process is terminated.
// * If SIGINT or SIGTERM are repeated 3 times before cleanup is complete, then cleanup is
// skipped and the process terminated directly.
// * If "DEBUG" is set in the environment, SIGQUIT causes an exit without cleanup.
//
func Trap(cleanup func()) {
c := make(chan os.Signal, 1)
signals := []os.Signal{os.Interrupt, syscall.SIGTERM}
if os.Getenv("DEBUG") == "" {
signals = append(signals, syscall.SIGQUIT)
}
gosignal.Notify(c, signals...)
go func() {
interruptCount := uint32(0)
for sig := range c {
go func(sig os.Signal) {
log.Printf("Received signal '%v', starting shutdown of docker.../n", sig)
switch sig {
case os.Interrupt, syscall.SIGTERM:
// If the user really wants to interrupt, let him do so.
if atomic.LoadUint32(&interruptCount) < 3 {
atomic.AddUint32(&interruptCount, 1)
// Initiate the cleanup only once
if atomic.LoadUint32(&interruptCount) == 1 {
// Call cleanup handler
cleanup()
os.Exit(0)
} else {
return
}
} else {
log.Printf("Force shutdown of docker, interrupting cleanup/n")
}
case syscall.SIGQUIT:
}
os.Exit(128 + int(sig.(syscall.Signal)))
}(sig)
}
}()
}
- 创建并设置一个channel,用于发送信号通知;
- 定义signals数组变量,初始值为os.SIGINT, os.SIGTERM;若环境变量DEBUG为空的话,则添加os.SIGQUIT至signals数组;
- 通过gosignal.Notify(c, signals...)中Notify函数来实现将接收到的signal信号传递给c。需要注意的是只有signals中被罗列出的信号才会被传递给c,其余信号会被直接忽略;
- 创建一个goroutine来处理具体的signal信号,当信号类型为os.Interrupt或者syscall.SIGTERM时,执行传入Trap函数的具体执行方法,形参为cleanup(),实参为eng.Shutdown。
Shutdown()函数的定义位于./docker/engine/engine.go,主要做的工作是为Docker Daemon的关闭做一些善后工作。
/engine/engine.go
// Shutdown permanently shuts down eng as follows:
// - It refuses all new jobs, permanently.
// - It waits for all active jobs to complete (with no timeout)
// - It calls all shutdown handlers concurrently (if any)
// - It returns when all handlers complete, or after 15 seconds,
// whichever happens first.
func (eng *Engine) Shutdown() {
eng.l.Lock()
if eng.shutdown {
eng.l.Unlock()
return
}
eng.shutdown = true
eng.l.Unlock()
// We don't need to protect the rest with a lock, to allow
// for other calls to immediately fail with "shutdown" instead
// of hanging for 15 seconds.
// This requires all concurrent calls to check for shutdown, otherwise
// it might cause a race.
// Wait for all jobs to complete.
// Timeout after 5 seconds.
tasksDone := make(chan struct{})
go func() {
eng.tasks.Wait()
close(tasksDone)
}()
select {
case <-time.After(time.Second * 5):
case <-tasksDone:
}
// Call shutdown handlers, if any.
// Timeout after 10 seconds.
var wg sync.WaitGroup
for _, h := range eng.onShutdown {
wg.Add(1)
go func(h func()) {
defer wg.Done()
h()
}(h)
}
done := make(chan struct{})
go func() {
wg.Wait()
close(done)
}()
select {
case <-time.After(time.Second * 10):
case <-done:
}
return
}
- Docker Daemon不再接收任何新的Job;
- Docker Daemon等待所有存活的Job执行完毕;
- Docker Daemon调用所有shutdown的处理方法;
- 当所有的handler执行完毕,或者15秒之后,Shutdown()函数返回。
由于在signal.Trap( eng.Shutdown )函数的具体实现中执行eng.Shutdown,在执行完eng.Shutdown之后,随即执行os.Exit(0),完成当前程序的立即退出。
3.5. 加载builtins
/docker/daemon.go
if err := builtins.Register(eng); err != nil {
log.Fatal(err)
}
为engine注册多个Handler,以便后续在执行相应任务时,运行指定的Handler。
这些Handler包括:
- 网络初始化、
- web API服务、
- 事件查询、
- 版本查看、
- Docker Registry验证与搜索。
/builtins/builtins.go
func Register(eng *engine.Engine) error {
if err := daemon(eng); err != nil {
return err
}
if err := remote(eng); err != nil {
return err
}
if err := events.New().Install(eng); err != nil {
return err
}
if err := eng.Register("version", dockerVersion); err != nil {
return err
}
return registry.NewService().Install(eng)
}
3.5.1. 注册初始化网络驱动的Handler
daemon(eng)的实现过程,主要为eng对象注册了一个key为”init_networkdriver”的Handler,该Handler的值为bridge.InitDriver函数,代码如下:
/builtins/builtins.go
func daemon(eng *engine.Engine) error {
return eng.Register("init_networkdriver", bridge.InitDriver)
}
需要注意的是,向eng对象注册Handler,并不代表Handler的值函数会被直接运行,如bridge.InitDriver,并不会直接运行,而是将bridge.InitDriver的函数入口,写入eng的handlers属性中。
/daemon/networkdriver/bridge/driver.go
func InitDriver(job *engine.Job) engine.Status {
var (
network *net.IPNet
enableIPTables = job.GetenvBool("EnableIptables")
icc = job.GetenvBool("InterContainerCommunication")
ipForward = job.GetenvBool("EnableIpForward")
bridgeIP = job.Getenv("BridgeIP")
)
if defaultIP := job.Getenv("DefaultBindingIP"); defaultIP != "" {
defaultBindingIP = net.ParseIP(defaultIP)
}
bridgeIface = job.Getenv("BridgeIface")
usingDefaultBridge := false
if bridgeIface == "" {
usingDefaultBridge = true
bridgeIface = DefaultNetworkBridge
}
addr, err := networkdriver.GetIfaceAddr(bridgeIface)
if err != nil {
// If we're not using the default bridge, fail without trying to create it
if !usingDefaultBridge {
job.Logf("bridge not found: %s", bridgeIface)
return job.Error(err)
}
// If the iface is not found, try to create it
job.Logf("creating new bridge for %s", bridgeIface)
if err := createBridge(bridgeIP); err != nil {
return job.Error(err)
}
job.Logf("getting iface addr")
addr, err = networkdriver.GetIfaceAddr(bridgeIface)
if err != nil {
return job.Error(err)
}
network = addr.(*net.IPNet)
} else {
network = addr.(*net.IPNet)
// validate that the bridge ip matches the ip specified by BridgeIP
if bridgeIP != "" {
bip, _, err := net.ParseCIDR(bridgeIP)
if err != nil {
return job.Error(err)
}
if !network.IP.Equal(bip) {
return job.Errorf("bridge ip (%s) does not match existing bridge configuration %s", network.IP, bip)
}
}
}
// Configure iptables for link support
if enableIPTables {
if err := setupIPTables(addr, icc); err != nil {
return job.Error(err)
}
}
if ipForward {
// Enable IPv4 forwarding
if err := ioutil.WriteFile("/proc/sys/net/ipv4/ip_forward", []byte{'1', '/n'}, 0644); err != nil {
job.Logf("WARNING: unable to enable IPv4 forwarding: %s/n", err)
}
}
// We can always try removing the iptables
if err := iptables.RemoveExistingChain("DOCKER"); err != nil {
return job.Error(err)
}
if enableIPTables {
chain, err := iptables.NewChain("DOCKER", bridgeIface)
if err != nil {
return job.Error(err)
}
portmapper.SetIptablesChain(chain)
}
bridgeNetwork = network
// https://github.com/docker/docker/issues/2768
job.Eng.Hack_SetGlobalVar("httpapi.bridgeIP", bridgeNetwork.IP)
for name, f := range map[string]engine.Handler{
"allocate_interface": Allocate,
"release_interface": Release,
"allocate_port": AllocatePort,
"link": LinkContainers,
} {
if err := job.Eng.Register(name, f); err != nil {
return job.Error(err)
}
}
return engine.StatusOK
}
Bridge.InitDriver的作用:
- 获取为Docker服务的网络设备的地址;
- 创建指定IP地址的网桥;
- 配置网络iptables规则;
- 另外还为eng对象注册了多个Handler,如 ”allocate_interface”, ”release_interface”, ”allocate_port”,”link”。
3.5.2. 注册API服务的Handler
remote(eng)的实现过程,主要为eng对象注册了两个Handler,分别为”serveapi”与”acceptconnections”。代码实现如下:
/builtins/builtins.go
func remote(eng *engine.Engine) error {
if err := eng.Register("serveapi", apiserver.ServeApi); err != nil {
return err
}
return eng.Register("acceptconnections", apiserver.AcceptConnections)
}
注册的两个Handler名称分别为”serveapi”与”acceptconnections”
- ServeApi执行时,通过循环多种协议,创建出goroutine来配置指定的http.Server,最终为不同的协议请求服务;
- AcceptConnections的实现主要是为了通知init守护进程,Docker Daemon已经启动完毕,可以让Docker Daemon进程接受请求。(守护进程)
3.5.3. 注册events事件的Handler
events.New().Install(eng)的实现过程,为Docker注册了多个event事件,功能是给Docker用户提供API,使得用户可以通过这些API查看Docker内部的events信息,log信息以及subscribers_count信息。
/events/events.go
type Events struct {
mu sync.RWMutex
events []*utils.JSONMessage
subscribers []listener
}
func New() *Events {
return &Events{
events: make([]*utils.JSONMessage, 0, eventsLimit),
}
}
// Install installs events public api in docker engine
func (e *Events) Install(eng *engine.Engine) error {
// Here you should describe public interface
jobs := map[string]engine.Handler{
"events": e.Get,
"log": e.Log,
"subscribers_count": e.SubscribersCount,
}
for name, job := range jobs {
if err := eng.Register(name, job); err != nil {
return err
}
}
return nil
}
3.5.4. 注册版本的Handler
eng.Register(“version”,dockerVersion)的实现过程,向eng对象注册key为”version”,value为”dockerVersion”执行方法的Handler,dockerVersion的执行过程中,会向名为version的job的标准输出中写入Docker的版本,Docker API的版本,git版本,Go语言运行时版本以及操作系统等版本信息。
/builtins/builtins.go
// builtins jobs independent of any subsystem
func dockerVersion(job *engine.Job) engine.Status {
v := &engine.Env{}
v.SetJson("Version", dockerversion.VERSION)
v.SetJson("ApiVersion", api.APIVERSION)
v.Set("GitCommit", dockerversion.GITCOMMIT)
v.Set("GoVersion", runtime.Version())
v.Set("Os", runtime.GOOS)
v.Set("Arch", runtime.GOARCH)
if kernelVersion, err := kernel.GetKernelVersion(); err == nil {
v.Set("KernelVersion", kernelVersion.String())
}
if _, err := v.WriteTo(job.Stdout); err != nil {
return job.Error(err)
}
return engine.StatusOK
}
3.5.5. 注册registry的Handler
registry.NewService().Install(eng)的实现过程位于./docker/registry/service.go,在eng对象对外暴露的API信息中添加docker registry的信息。当registry.NewService()成功被Install安装完毕的话,则有两个调用能够被eng使用:”auth”,向公有registry进行认证;”search”,在公有registry上搜索指定的镜像。
/registry/service.go
// NewService returns a new instance of Service ready to be
// installed no an engine.
func NewService() *Service {
return &Service{}
}
// Install installs registry capabilities to eng.
func (s *Service) Install(eng *engine.Engine) error {
eng.Register("auth", s.Auth)
eng.Register("search", s.Search)
return nil
}
3.6. 使用goroutine加载daemon对象
执行完builtins的加载,回到mainDaemon()的执行,通过一个goroutine来加载daemon对象并开始运行。这一环节的执行,主要包含三个步骤:
- 通过init函数中初始化的daemonCfg与eng对象来创建一个daemon对象d;(守护进程)
- 通过daemon对象的Install函数,向eng对象中注册众多的Handler;
- 在Docker Daemon启动完毕之后,运行名为”acceptconnections”的job,主要工作为向init守护进程发送”READY=1”信号,以便开始正常接受请求。
/docker/daemon.go
// load the daemon in the background so we can immediately start
// the http api so that connections don't fail while the daemon
// is booting
go func() {
d, err := daemon.NewDaemon(daemonCfg, eng)
if err != nil {
log.Fatal(err)
}
if err := d.Install(eng); err != nil {
log.Fatal(err)
}
// after the daemon is done setting up we can tell the api to start
// accepting connections
if err := eng.Job("acceptconnections").Run(); err != nil {
log.Fatal(err)
}
}()
3.6.1. 创建daemon对象
/docker/daemon.go
d, err := daemon.NewDaemon(daemonCfg, eng)
if err != nil {
log.Fatal(err)
}
daemon.NewDaemon(daemonCfg, eng)是创建daemon对象d的核心部分。主要作用为初始化Docker Daemon的基本环境,如处理config参数,验证系统支持度,配置Docker工作目录,设置与加载多种driver,创建graph环境等,验证DNS配置等。具体参考NewDaemon 。
3.6.2. 通过daemon对象为engine注册Handler
当创建完daemon对象,goroutine执行d.Install(eng)
/daemon/daemon.go
type Daemon struct {
repository string
sysInitPath string
containers *contStore
graph *graph.Graph
repositories *graph.TagStore
idIndex *truncindex.TruncIndex
sysInfo *sysinfo.SysInfo
volumes *graph.Graph
eng *engine.Engine
config *Config
containerGraph *graphdb.Database
driver graphdriver.Driver
execDriver execdriver.Driver
}
// Install installs daemon capabilities to eng.
func (daemon *Daemon) Install(eng *engine.Engine) error {
// FIXME: rename "delete" to "rm" for consistency with the CLI command
// FIXME: rename ContainerDestroy to ContainerRm for consistency with the CLI command
// FIXME: remove ImageDelete's dependency on Daemon, then move to graph/
for name, method := range map[string]engine.Handler{
"attach": daemon.ContainerAttach,
"build": daemon.CmdBuild,
"commit": daemon.ContainerCommit,
"container_changes": daemon.ContainerChanges,
"container_copy": daemon.ContainerCopy,
"container_inspect": daemon.ContainerInspect,
"containers": daemon.Containers,
"create": daemon.ContainerCreate,
"delete": daemon.ContainerDestroy,
"export": daemon.ContainerExport,
"info": daemon.CmdInfo,
"kill": daemon.ContainerKill,
"logs": daemon.ContainerLogs,
"pause": daemon.ContainerPause,
"resize": daemon.ContainerResize,
"restart": daemon.ContainerRestart,
"start": daemon.ContainerStart,
"stop": daemon.ContainerStop,
"top": daemon.ContainerTop,
"unpause": daemon.ContainerUnpause,
"wait": daemon.ContainerWait,
"image_delete": daemon.ImageDelete, // FIXME: see above
} {
if err := eng.Register(name, method); err != nil {
return err
}
}
if err := daemon.Repositories().Install(eng); err != nil {
return err
}
// FIXME: this hack is necessary for legacy integration tests to access
// the daemon object.
eng.Hack_SetGlobalVar("httpapi.daemon", daemon)
return nil
}
以上代码的实现分为三部分:
- 向eng对象中注册众多的Handler对象;
- daemon.Repositories().Install(eng)实现了向eng对象注册多个与image相关的Handler,Install的实现位于./docker/graph/service.go;
- eng.Hack_SetGlobalVar("httpapi.daemon", daemon)实现向eng对象中map类型的hack对象中添加一条记录,key为”httpapi.daemon”,value为daemon。
3.6.3. 运行acceptconnections的job
/docker/daemon.go
if err := eng.Job("acceptconnections").Run(); err != nil {
log.Fatal(err)
}
在goroutine内部最后运行名为”acceptconnections”的job,主要作用是通知init守护进程,Docker Daemon可以开始接受请求了。
首先执行eng.Job("acceptconnections"),返回一个Job,随后再执行eng.Job("acceptconnections").Run(),也就是该执行Job的run函数。
/engine/engine.go
// Job creates a new job which can later be executed.
// This function mimics `Command` from the standard os/exec package.
func (eng *Engine) Job(name string, args ...string) *Job {
job := &Job{
Eng: eng,
Name: name,
Args: args,
Stdin: NewInput(),
Stdout: NewOutput(),
Stderr: NewOutput(),
env: &Env{},
}
if eng.Logging {
job.Stderr.Add(utils.NopWriteCloser(eng.Stderr))
}
// Catchall is shadowed by specific Register.
if handler, exists := eng.handlers[name]; exists {
job.handler = handler
} else if eng.catchall != nil && name != "" {
// empty job names are illegal, catchall or not.
job.handler = eng.catchall
}
return job
}
- 首先创建一个类型为Job的job对象,该对象中Eng属性为函数的调用者eng,Name属性为”acceptconnections”,没有参数传入。
- 另外在eng对象所有的handlers属性中寻找键为”acceptconnections”记录的值,由于在加载builtins操作中的remote(eng)中已经向eng注册过这样的一条记录,key为”acceptconnections”,value为apiserver.AcceptConnections。
- 因此job对象的handler为apiserver.AcceptConnections。
- 最后返回已经初始化完毕的对象job。
创建完job对象之后,随即执行该job对象的run()函数。
/engine/job.go
// A job is the fundamental unit of work in the docker engine.
// Everything docker can do should eventually be exposed as a job.
// For example: execute a process in a container, create a new container,
// download an archive from the internet, serve the http api, etc.
//
// The job API is designed after unix processes: a job has a name, arguments,
// environment variables, standard streams for input, output and error, and
// an exit status which can indicate success (0) or error (anything else).
//
// One slight variation is that jobs report their status as a string. The
// string "0" indicates success, and any other strings indicates an error.
// This allows for richer error reporting.
//
type Job struct {
Eng *Engine
Name string
Args []string
env *Env
Stdout *Output
Stderr *Output
Stdin *Input
handler Handler
status Status
end time.Time
}
type Status int
const (
StatusOK Status = 0
StatusErr Status = 1
StatusNotFound Status = 127
)
// Run executes the job and blocks until the job completes.
// If the job returns a failure status, an error is returned
// which includes the status.
func (job *Job) Run() error {
if job.Eng.IsShutdown() {
return fmt.Errorf("engine is shutdown")
}
// FIXME: this is a temporary workaround to avoid Engine.Shutdown
// waiting 5 seconds for server/api.ServeApi to complete (which it never will)
// everytime the daemon is cleanly restarted.
// The permanent fix is to implement Job.Stop and Job.OnStop so that
// ServeApi can cooperate and terminate cleanly.
if job.Name != "serveapi" {
job.Eng.l.Lock()
job.Eng.tasks.Add(1)
job.Eng.l.Unlock()
defer job.Eng.tasks.Done()
}
// FIXME: make this thread-safe
// FIXME: implement wait
if !job.end.IsZero() {
return fmt.Errorf("%s: job has already completed", job.Name)
}
// Log beginning and end of the job
job.Eng.Logf("+job %s", job.CallString())
defer func() {
job.Eng.Logf("-job %s%s", job.CallString(), job.StatusString())
}()
var errorMessage = bytes.NewBuffer(nil)
job.Stderr.Add(errorMessage)
if job.handler == nil {
job.Errorf("%s: command not found", job.Name)
job.status = 127
} else {
job.status = job.handler(job)
job.end = time.Now()
}
// Wait for all background tasks to complete
if err := job.Stdout.Close(); err != nil {
return err
}
if err := job.Stderr.Close(); err != nil {
return err
}
if err := job.Stdin.Close(); err != nil {
return err
}
if job.status != 0 {
return fmt.Errorf("%s", Tail(errorMessage, 1))
}
return nil
}
Run()函数的实现位于./docker/engine/job.go,该函数执行指定的job,并在job执行完成前一直阻塞。对于名为”acceptconnections”的job对象,运行代码为job.status = job.handler(job),由于job.handler值为apiserver.AcceptConnections,故真正执行的是job.status = apiserver.AcceptConnections(job)。
进入AcceptConnections的具体实现,位于./docker/api/server/server.go,如下:
/api/server/server.go
func AcceptConnections(job *engine.Job) engine.Status {
// Tell the init daemon we are accepting requests
go systemd.SdNotify("READY=1")
if activationLock != nil {
close(activationLock)
}
return engine.StatusOK
}
重点为go systemd.SdNotify("READY=1")的实现,位于./docker/pkg/system/sd_notify.go,主要作用是通知init守护进程Docker Daemon的启动已经全部完成,潜在的功能是使得Docker Daemon开始接受Docker Client发送来的API请求。
至此,已经完成通过goroutine来加载daemon对象并运行。
3.7. 打印Docker版本及驱动信息
显示docker的版本信息,以及ExecDriver和GraphDriver这两个驱动的具体信息
/docker/daemon.go
// TODO actually have a resolved graphdriver to show?
log.Printf("docker daemon: %s %s; execdriver: %s; graphdriver: %s",
dockerversion.VERSION,
dockerversion.GITCOMMIT,
daemonCfg.ExecDriver,
daemonCfg.GraphDriver,
)
3.8. serveapi的创建与运行
打印部分Docker具体信息之后,Docker Daemon立即创建并运行名为”serveapi”的job,主要作用为让Docker Daemon提供API访问服务。
/docker/daemon.go
// Serve api
job := eng.Job("serveapi", flHosts...)
job.SetenvBool("Logging", true)
job.SetenvBool("EnableCors", *flEnableCors)
job.Setenv("Version", dockerversion.VERSION)
job.Setenv("SocketGroup", *flSocketGroup)
job.SetenvBool("Tls", *flTls)
job.SetenvBool("TlsVerify", *flTlsVerify)
job.Setenv("TlsCa", *flCa)
job.Setenv("TlsCert", *flCert)
job.Setenv("TlsKey", *flKey)
job.SetenvBool("BufferRequests", true)
if err := job.Run(); err != nil {
log.Fatal(err)
}
- 创建一个名为”serveapi”的job,并将flHosts的值赋给job.Args。flHost的作用主要是为Docker Daemon提供使用的协议与监听的地址。
- Docker Daemon为该job设置了众多的环境变量,如安全传输层协议的环境变量等。最后通过job.Run()运行该serveapi的job。
由于在eng中key为”serveapi”的handler,value为apiserver.ServeApi,故该job运行时,执行apiserver.ServeApi函数,位于./docker/api/server/server.go。ServeApi函数的作用主要是对于用户定义的所有支持协议,Docker Daemon均创建一个goroutine来启动相应的http.Server,分别为不同的协议服务。具体参考Docker Server。
参考:
- 《Docker源码分析》
12.3.4 -
1. Docker Server创建流程
Docker Server是Daemon Server的重要组成部分,功能:接收Docker Client发送的请求,并按照相应的路由规则实现请求的路由分发,最终将请求处理的结果返回给Docker Client。 Docker Daemon启动,在mainDaemon()运行的最后创建并运行serverapi的Job,让Docker Daemon提供API访问服务。 Docker Server的整个生命周期
- 创建Docker Server的Job
- 配置Job的环境变量
- 触发执行Job
说明:本文分析的代码为Docker 1.2.0版本。
1.1. 创建“serverapi”的Job
/docker/daemon.go
func mainDaemon() {
...
// Serve api
job := eng.Job("serveapi", flHosts...)
...
}
运行serverapi的Job时,会执行该Job的处理方法api.ServeApi。
1.2. 配置Job环境变量
/docker/daemon.go
job.SetenvBool("Logging", true)
job.SetenvBool("EnableCors", *flEnableCors)
job.Setenv("Version", dockerversion.VERSION)
job.Setenv("SocketGroup", *flSocketGroup)
job.SetenvBool("Tls", *flTls)
job.SetenvBool("TlsVerify", *flTlsVerify)
job.Setenv("TlsCa", *flCa)
job.Setenv("TlsCert", *flCert)
job.Setenv("TlsKey", *flKey)
job.SetenvBool("BufferRequests", true)
参数分为两种
- 创建Job实例时,用指定参数直接初始化Job的Args属性
- 创建Job后,给Job添加指定的环境变量
| 环境变量名 | FLAG参数 | 默认 | 作用值 |
|---|---|---|---|
| Logging | true | 启用Docker容器的日志输出 | |
| EnableCors | flEnableCors | false | 在远程API中提供CORS头 |
| Version | 显示Docker版本号 | ||
| SocketGroup | flSockerGroup | docker | 在daemon模式中unix domain socket分配用户组名 |
| Tls | flTls | false | 使用TLS安全传输协议 |
| TlsVerify | flTlsVerify | false | 使用TLS并验证远程客户端 |
| TlsCa | flCa | 指定CA文件路径 | |
| TlsCert | flCert | TLS证书文件路径 | |
| TlsKey | flKey | TLS密钥文件路径 | |
| BufferRequest | true | 缓存Docker Client请求 |
1.3. 运行Job
/api/server/server.go
if err := job.Run(); err != nil {
log.Fatal(err)
}
Docker在eng对象中注册过键位serverapi的处理方法,在运行Job的时候执行这个处理方法的值函数,相应的处理方法的值为api.ServeApi。
2. ServeApi运行流程
ServeApi属于Docker Server提供API服务的部分,作为一个监听请求、处理请求、响应请求的服务端,支持三种协议:TCP协议、UNIX Socket形式以及fd的形式。功能是:循环检查Docker Daemon支持的所有协议,并为每一种协议创建一个协程goroutine,并在协程内部配置一个服务于HTTP请求的服务端。
/api/server/server.go
// ServeApi loops through all of the protocols sent in to docker and spawns
// off a go routine to setup a serving http.Server for each.
func ServeApi(job *engine.Job) engine.Status {
if len(job.Args) == 0 {
return job.Errorf("usage: %s PROTO://ADDR [PROTO://ADDR ...]", job.Name)
}
var (
protoAddrs = job.Args
chErrors = make(chan error, len(protoAddrs))
)
activationLock = make(chan struct{})
for _, protoAddr := range protoAddrs {
protoAddrParts := strings.SplitN(protoAddr, "://", 2)
if len(protoAddrParts) != 2 {
return job.Errorf("usage: %s PROTO://ADDR [PROTO://ADDR ...]", job.Name)
}
go func() {
log.Infof("Listening for HTTP on %s (%s)", protoAddrParts[0], protoAddrParts[1])
chErrors <- ListenAndServe(protoAddrParts[0], protoAddrParts[1], job)
}()
}
for i := 0; i < len(protoAddrs); i += 1 {
err := <-chErrors
if err != nil {
return job.Error(err)
}
}
return engine.StatusOK
}
ServeApi执行流程:
- 检查Job参数,确保传入参数无误
- 定义Docker Server的监听协议与地址,以及错误信息管理channel
- 遍历协议地址,针对协议创建相应的服务端
- 通过chErrors建立goroutine与主进程之间的协调关系
2.1. 判断Job参数
判断Job参数,job.Args,即数组flHost,若flHost的长度为0,则说明没有监听的协议与地址,参数有误。
/api/server/server.go
func ServeApi(job *engine.Job) engine.Status {
if len(job.Args) == 0 {
return job.Errorf("usage: %s PROTO://ADDR [PROTO://ADDR ...]", job.Name)
}
...
}
2.2. 定义监听协议与地址及错误信息
/api/server/server.go
var (
protoAddrs = job.Args
chErrors = make(chan error, len(protoAddrs))
)
activationLock = make(chan struct{})
定义protoAddrs[flHosts的内容]、chErrors[错误类型管道]与activationLock[同步serveapi和acceptconnections两个job执行的管道]三个变量,
2.3. 遍历协议地址
/api/server/server.go
for _, protoAddr := range protoAddrs {
protoAddrParts := strings.SplitN(protoAddr, "://", 2)
if len(protoAddrParts) != 2 {
return job.Errorf("usage: %s PROTO://ADDR [PROTO://ADDR ...]", job.Name)
}
go func() {
log.Infof("Listening for HTTP on %s (%s)", protoAddrParts[0], protoAddrParts[1])
chErrors <- ListenAndServe(protoAddrParts[0], protoAddrParts[1], job)
}()
}
遍历协议地址,针对协议创建相应的服务端。协议地址
2.4. 协调chErrors与主进程关系
根据chErrors的值运行,如果chErrors这个管道中有错误内容,则ServerApi一次循环结束,若无错误内容,循环被阻塞。即chErrors确保ListenAndServe所对应的协程能和主函数ServeApi进行协调,如果协程出错,主函数ServeApi仍然可以捕获这样的错误,从而导致程序退出。
/api/server/server.go
for i := 0; i < len(protoAddrs); i += 1 {
err := <-chErrors
if err != nil {
return job.Error(err)
}
}
return engine.StatusOK
3. ListenAndServe实现
ListenAndServe的功能:使Docker Server监听某一指定地址,并接收该地址的请求,并对以上请求路由转发至相应的处理方法处。 ListenAndServe执行流程:
- 创建route路由实例
- 创建listener监听实例
- 创建http.Server
- 启动API服务
流程图:

3.1. 创建route路由实例
/api/server/server.go
// ListenAndServe sets up the required http.Server and gets it listening for
// each addr passed in and does protocol specific checking.
func ListenAndServe(proto, addr string, job *engine.Job) error {
var l net.Listener
r, err := createRouter(job.Eng, job.GetenvBool("Logging"), job.GetenvBool("EnableCors"), job.Getenv("Version"))
if err != nil {
return err
}
...
}
路由实例的作用:负责Docker Server对外部请求的路由及转发。 实现过程:
- 创建全新的route路由实例
- 为route实例添加路由记录
3.1.1. 创建空路由实例
/api/server/server.go
func createRouter(eng *engine.Engine, logging, enableCors bool, dockerVersion string) (*mux.Router, error) {
r := mux.NewRouter()
...
}
/vendor/src/github.com/gorilla/mux/mux.go
// NewRouter returns a new router instance.
func NewRouter() *Router {
return &Router{namedRoutes: make(map[string]*Route), KeepContext: false}
}
// This will send all incoming requests to the router.
type Router struct {
// Configurable Handler to be used when no route matches.
NotFoundHandler http.Handler
// Parent route, if this is a subrouter.
parent parentRoute
// Routes to be matched, in order.
routes []*Route
// Routes by name for URL building.
namedRoutes map[string]*Route
// See Router.StrictSlash(). This defines the flag for new routes.
strictSlash bool
// If true, do not clear the request context after handling the request
KeepContext bool
}
NewRoute()函数返回一个全新的route实例r,类型为mux.Router。实例初始化nameRoutes和KeepContext。
- nameRoutes:map类型,key为string类型,value为Route路由记录类型
- KeepContext:属性为false,则处理完请求后清除请求内容,不对请求做存储操作
mux.Router会通过一系列已经注册过的路由记录,来匹配接收的请求。先通过请求的URL或者其他条件找到相应的路由记录,并调用这条记录中的执行处理方法。 mux.Router特性
- 请求可以基于URL的主机名、路径、路径前缀、shemes、请求头和请求值、HTTP请求方法类型或者使用自定义的匹配规则
- URL主机名和路径可以通过一个正则表达式来表示
- 注册的URL可以直接被运用,也可以保留从而保证维护资源的使用
- 路由记录同样看可以作用于子路由记录
3.1.2. 添加路由记录
/api/server/server.go
if os.Getenv("DEBUG") != "" {
AttachProfiler(r)
}
m := map[string]map[string]HttpApiFunc{
"GET": {
"/_ping": ping,
"/events": getEvents,
"/info": getInfo,
"/version": getVersion,
"/images/json": getImagesJSON,
"/images/viz": getImagesViz,
"/images/search": getImagesSearch,
"/images/{name:.*}/get": getImagesGet,
"/images/{name:.*}/history": getImagesHistory,
"/images/{name:.*}/json": getImagesByName,
"/containers/ps": getContainersJSON,
"/containers/json": getContainersJSON,
"/containers/{name:.*}/export": getContainersExport,
"/containers/{name:.*}/changes": getContainersChanges,
"/containers/{name:.*}/json": getContainersByName,
"/containers/{name:.*}/top": getContainersTop,
"/containers/{name:.*}/logs": getContainersLogs,
"/containers/{name:.*}/attach/ws": wsContainersAttach,
},
"POST": {
"/auth": postAuth,
"/commit": postCommit,
"/build": postBuild,
"/images/create": postImagesCreate,
"/images/load": postImagesLoad,
"/images/{name:.*}/push": postImagesPush,
"/images/{name:.*}/tag": postImagesTag,
"/containers/create": postContainersCreate,
"/containers/{name:.*}/kill": postContainersKill,
"/containers/{name:.*}/pause": postContainersPause,
"/containers/{name:.*}/unpause": postContainersUnpause,
"/containers/{name:.*}/restart": postContainersRestart,
"/containers/{name:.*}/start": postContainersStart,
"/containers/{name:.*}/stop": postContainersStop,
"/containers/{name:.*}/wait": postContainersWait,
"/containers/{name:.*}/resize": postContainersResize,
"/containers/{name:.*}/attach": postContainersAttach,
"/containers/{name:.*}/copy": postContainersCopy,
},
"DELETE": {
"/containers/{name:.*}": deleteContainers,
"/images/{name:.*}": deleteImages,
},
"OPTIONS": {
"": optionsHandler,
},
}
m的类型为映射,key表示HTTP的请求类型,如GET、POST、DELETE等,value为映射类型,代表URL与执行处理方法的映射。
/api/server/server.go
type HttpApiFunc func(eng *engine.Engine, version version.Version, w http.ResponseWriter, r *http.Request, vars map[string]string) error
3.2. 创建listener监听实例
路由模块完成请求的路由与分发,监听模块完成请求的监听功能。Listener是一种面向流协议的通用网络监听模块。
/api/server/server.go
var l net.Listener
...
if job.GetenvBool("BufferRequests") {
l, err = listenbuffer.NewListenBuffer(proto, addr, activationLock)
} else {
l, err = net.Listen(proto, addr)
Listenbuffer的作用:让Docker Server立即监听指定协议地址上的请求,但将这些请求暂时先缓存下来,等Docker Daemon全部启动完毕之后才让Docker Server开始接受这些请求。
/pkg/listenbuffer/buffer.go
// NewListenBuffer returns a listener listening on addr with the protocol.
func NewListenBuffer(proto, addr string, activate chan struct{}) (net.Listener, error) {
wrapped, err := net.Listen(proto, addr)
if err != nil {
return nil, err
}
return &defaultListener{
wrapped: wrapped,
activate: activate,
}, nil
}
若协议类型为TCP,Job环境变量中Tls或TlsVerity有一个为true,则说明Docker Server需要支持HTTPS服务。需要建立一个tls.Config类型实例tlsConfig,在tlsConfig中加载证书、认证信息,通过tls包中的NewListener函数创建HTTPS协议请求的Listener实例。
/api/server/server.go
l = tls.NewListener(l, tlsConfig)
3.3. 创建http.Server
/api/server/server.go
httpSrv := http.Server{Addr: addr, Handler: r}
Docker Server需要创建一个Server对象来运行HTTP/HTTPS服务端,创建http.Server,addr为需要监听的地址,r为mux.Router。
3.4. 启动API服务
创建http.Server实例后,即启动API服务,监听请求,并对每一个请求生成一个新的协程来做专属服务。对于每个请求,协程会读取请求,查询路由表中的路由记录项,找到匹配的路由记录,最终调用路由记录中的处理方法,执行完毕返回响应信息。
/api/server/server.go
return httpSrv.Serve(l)
参考:
- 《Docker源码分析》
12.3.5 -
1. Docker的总架构图
docker是一个C/S模式的架构,后端是一个松耦合架构,模块各司其职。
- 用户是使用Docker Client与Docker Daemon建立通信,并发送请求给后者。
- Docker Daemon作为Docker架构中的主体部分,首先提供Server的功能使其可以接受Docker Client的请求;
- Engine执行Docker内部的一系列工作,每一项工作都是以一个Job的形式的存在。
- Job的运行过程中,当需要容器镜像时,则从Docker Registry中下载镜像,并通过镜像管理驱动graphdriver将下载镜像以Graph的形式存储;
- 当需要为Docker创建网络环境时,通过网络管理驱动networkdriver创建并配置Docker容器网络环境;
- 当需要限制Docker容器运行资源或执行用户指令等操作时,则通过execdriver来完成。
- libcontainer是一项独立的容器管理包,networkdriver以及execdriver都是通过libcontainer来实现具体对容器进行的操作。
2. Docker各模块组件分析
2.1. Docker Client[发起请求]
- Docker Client是和Docker Daemon建立通信的客户端。用户使用的可执行文件为docker(类似可执行脚本的命令),docker命令后接参数的形式来实现一个完整的请求命令(例如docker images,docker为命令不可变,images为参数可变)。
- Docker Client可以通过以下三种方式和Docker Daemon建立通信:tcp://host:port,unix://path_to_socket和fd://socketfd。
- Docker Client发送容器管理请求后,由Docker Daemon接受并处理请求,当Docker Client接收到返回的请求相应并简单处理后,Docker Client一次完整的生命周期就结束了。[一次完整的请求:发送请求→处理请求→返回结果],与传统的C/S架构请求流程并无不同。
2.2. Docker Daemon[后台守护进程]
Docker Daemon的架构图
2.2.1. Docker Server[调度分发请求]
Docker Server的架构图
- Docker Server相当于C/S架构的服务端。功能为接受并调度分发Docker Client发送的请求。接受请求后,Server通过路由与分发调度,找到相应的Handler来执行请求。
- 在Docker的启动过程中,通过包gorilla/mux,创建了一个mux.Router,提供请求的路由功能。在Golang中,gorilla/mux是一个强大的URL路由器以及调度分发器。该mux.Router中添加了众多的路由项,每一个路由项由HTTP请求方法(PUT、POST、GET或DELETE)、URL、Handler三部分组成。
- 创建完mux.Router之后,Docker将Server的监听地址以及mux.Router作为参数,创建一个httpSrv=http.Server{},最终执行httpSrv.Serve()为请求服务。
- 在Server的服务过程中,Server在listener上接受Docker Client的访问请求,并创建一个全新的goroutine来服务该请求。在goroutine中,首先读取请求内容,然后做解析工作,接着找到相应的路由项,随后调用相应的Handler来处理该请求,最后Handler处理完请求之后回复该请求。
2.2.2. Engine
- Engine是Docker架构中的运行引擎,同时也Docker运行的核心模块。它扮演Docker container存储仓库的角色,并且通过执行job的方式来操纵管理这些容器。
- 在Engine数据结构的设计与实现过程中,有一个handler对象。该handler对象存储的都是关于众多特定job的handler处理访问。举例说明,Engine的handler对象中有一项为:{"create": daemon.ContainerCreate,},则说明当名为"create"的job在运行时,执行的是daemon.ContainerCreate的handler。
2.2.3. Job
- 一个Job可以认为是Docker架构中Engine内部最基本的工作执行单元。Docker可以做的每一项工作,都可以抽象为一个job。例如:在容器内部运行一个进程,这是一个job;创建一个新的容器,这是一个job。Docker Server的运行过程也是一个job,名为serveapi。
- Job的设计者,把Job设计得与Unix进程相仿。比如说:Job有一个名称,有参数,有环境变量,有标准的输入输出,有错误处理,有返回状态等。
2.3. Docker Registry[镜像注册中心]
- Docker Registry是一个存储容器镜像的仓库(注册中心),可理解为云端镜像仓库,按repository来分类,docker pull 按照[repository]:[tag]来精确定义一个image。
- 在Docker的运行过程中,Docker Daemon会与Docker Registry通信,并实现搜索镜像、下载镜像、上传镜像三个功能,这三个功能对应的job名称分别为"search","pull" 与 "push"。
- 可分为公有仓库(docker hub)和私有仓库。
2.4. Graph[docker内部数据库]
Graph的架构图
2.4.1. Repository
- 已下载镜像的保管者(包括下载镜像和dockerfile构建的镜像)。
- 一个repository表示某类镜像的仓库(例如Ubuntu),同一个repository内的镜像用tag来区分(表示同一类镜像的不同标签或版本)。一个registry包含多个repository,一个repository包含同类型的多个image。
- 镜像的存储类型有aufs,devicemapper,Btrfs,Vfs等。其中centos系统使用devicemapper的存储类型。
- 同时在Graph的本地目录中,关于每一个的容器镜像,具体存储的信息有:该容器镜像的元数据,容器镜像的大小信息,以及该容器镜像所代表的具体rootfs。
2.4.2. GraphDB
- 已下载容器镜像之间关系的记录者。
- GraphDB是一个构建在SQLite之上的小型图数据库,实现了节点的命名以及节点之间关联关系的记录
2.5. Driver[执行部分]
Driver是Docker架构中的驱动模块。通过Driver驱动,Docker可以实现对Docker容器执行环境的定制。即Graph负责镜像的存储,Driver负责容器的执行。
2.5.1. graphdriver
graphdriver架构图
- graphdriver主要用于完成容器镜像的管理,包括存储与获取。
- 存储:docker pull下载的镜像由graphdriver存储到本地的指定目录(Graph中)。
- 获取:docker run(create)用镜像来创建容器的时候由graphdriver到本地Graph中获取镜像。
2.5.2. networkdriver
networkdriver的架构图
- networkdriver的用途是完成Docker容器网络环境的配置,其中包括
- Docker启动时为Docker环境创建网桥;
- Docker容器创建时为其创建专属虚拟网卡设备;
- Docker容器分配IP、端口并与宿主机做端口映射,设置容器防火墙策略等。
2.5.3. execdriver
execdriver的架构图
- execdriver作为Docker容器的执行驱动,负责创建容器运行命名空间,负责容器资源使用的统计与限制,负责容器内部进程的真正运行等。
- 现在execdriver默认使用native驱动,不依赖于LXC。
2.6. libcontainer[函数库]
libcontainer的架构图
- libcontainer是Docker架构中一个使用Go语言设计实现的库,设计初衷是希望该库可以不依靠任何依赖,直接访问内核中与容器相关的API。
- Docker可以直接调用libcontainer,而最终操纵容器的namespace、cgroups、apparmor、网络设备以及防火墙规则等。
- libcontainer提供了一整套标准的接口来满足上层对容器管理的需求。或者说,libcontainer屏蔽了Docker上层对容器的直接管理。
2.7. docker container[服务交付的最终形式]
container架构
-
Docker container(Docker容器)是Docker架构中服务交付的最终体现形式。
-
Docker按照用户的需求与指令,订制相应的Docker容器:
-
- 用户通过指定容器镜像,使得Docker容器可以自定义rootfs等文件系统;
- 用户通过指定计算资源的配额,使得Docker容器使用指定的计算资源;
- 用户通过配置网络及其安全策略,使得Docker容器拥有独立且安全的网络环境;
- 用户通过指定运行的命令,使得Docker容器执行指定的工作。
参考文章:
- 《Docker源码分析》
12.3.6 -
1. 基本概念
1.1. image layer(镜像层)
镜像可以看成是由多个镜像层叠加起来的一个文件系统,镜像层也可以简单理解为一个基本的镜像,而每个镜像层之间通过指针的形式进行叠加。

根据上图,镜像层的主要组成部分包括镜像层id,镜像层指针【指向父层】,元数据【layer metadata】包含了docker构建和运行的信息还有父层的层次信息。
只读层和读写层【top layer】的组成部分基本一致。同时读写层可以转换成只读层【docker commit操作实现】
1.2. image(镜像)---【只读层的集合】
1、镜像是一堆只读层的统一视角,除了最底层没有指向外,每一层都指向它的父层,统一文件系统(union file system)技术能够将不同的层整合成一个文件系统,为这些层提供了一个统一的视角,这样就隐藏了多层的存在,在用户的角度看来,只存在一个文件系统。而每一层都是不可写的,就是只读层。

1.3. container(容器)---【一层读写层+多层只读层】
1、容器和镜像的区别在于容器的最上面一层是读写层【top layer】,而这边并没有区分容器是否在运行。运行状态的容器【running container】即一个可读写的文件系统【静态容器】+隔离的进程空间和其中的进程。

隔离的进程空间中的进程可以对该读写层进行增删改,其运行状态容器的进程操作都作用在该读写层上。每个容器只能有一个进程隔离空间。

2. Docker常用命令原理图概览:
3. Docker常用命令说明
3.1. 标识说明
3.1.1. image---(统一只读文件系统)

3.1.2. 静态容器【未运行的容器】---(统一可读写文件系统)

3.1.3. 动态容器【running container】---(进程空间(包括进程)+统一可读写文件系统)

3.2. 命令说明
3.2.1. docker生命周期相关命令:
3.2.1.1. docker create {image-id}

即为只读文件系统添加一层可读写层【top layer】,生成可读写文件系统,该命令状态下容器为静态容器,并没有运行。
3.2.1.2. docker start(restart) {container-id}
docker stop即为docker start的逆过程

即为可读写文件系统添加一个进程空间【包括进程】,生成动态容器【running container】
3.2.1.3. docker run {image-id}

docker run=docker create+docker start
类似流程如下 :

3.2.1.4. docker stop {container-id}

向运行的容器中发一个SIGTERM的信号,然后停止所有的进程。即为docker start的逆过程。
3.2.1.5. docker kill {container-id}

docker kill向容器发送不友好的SIGKILL的信号,相当于快速强制关闭容器,与docker stop的区别在于docker stop是正常关闭,先发SIGTERM信号,清理进程,再发SIGKILL信号退出。
3.2.1.6. docker pause {container-id}
docker unpause为逆过程---比较少使用

暂停容器中的所有进程,使用cgroup的freezer顺序暂停容器里的所有进程,docker unpause为逆过程即恢复所有进程。比较少使用。
3.2.1.7. docker commit {container-id}


把容器的可读写层转化成只读层,即从容器状态【可读写文件系统】变为镜像状态【只读文件系统】,可理解为【固化】。
3.2.1.8. docker build


docker build=docker run【运行容器】+【进程修改数据】+docker commit【固化数据】,不断循环直至生成所需镜像。
循环一次便会形成新的层(镜像)【原镜像层+已固化的可读写层】
docker build 一般作用在dockerfile文件上。
3.2.2. docker查询类命令
查询对象:①image,②container,③image/container中的数据,④系统信息[容器数,镜像数及其他]
3.2.2.1. Image
1、docker images

docker images 列出当前镜像【以顶层镜像id来表示整个完整镜像】,每个顶层镜像下面隐藏多个镜像层。
2、docker images -a

docker images -a列出所有镜像层【排序以每个顶层镜像id为首后接该镜像下的所有镜像层】,依次列出每个镜像的所有镜像层。
3、docker history {image-id}

docker history 列出该镜像id下的所有历史镜像。
3.2.2.2. Container
1、docker ps

列出所有运行的容器【running container】
2、docker ps -a

列出所有容器,包括静态容器【未运行的容器】和动态容器【running container】
3.2.2.3. Info
1、docker inspect {container-id} or {image-id}

提取出容器或镜像最顶层的元数据。
2、docker info
显示 Docker 系统信息,包括镜像和容器数。
3.2.3. docker操作类命令:
3.2.3.1. docker rm {container-id}

docker rm会移除镜像,该命令只能对静态容器【非运行状态】进行操作。
通过docker rm -f {container-id}的-f (force)参数可以强制删除运行状态的容器【running container】。
3.2.3.2. docker rmi {image-id}

3.2.3.3. docker exec {running-container-id}

docker exec会在运行状态的容器中执行一个新的进程。
3.2.3.4. docker export {container-id}

docker export命令创建一个tar文件,并且移除了元数据和不必要的层,将多个层整合成了一个层,只保存了当前统一视角看到的内容。
参考文章:
12.3.7 -
1. Dockerfile的说明
dockerfile指令忽略大小写,建议大写,#作为注释,每行只支持一条指令,指令可以带多个参数。
dockerfile指令分为构建指令和设置指令。
- 构建指令:用于构建image,其指定的操作不会在运行image的容器中执行。
- 设置指令:用于设置image的属性,其指定的操作会在运行image的容器中执行。
2. Dockerfile指令说明
2.1. FROM(指定基础镜像)[构建指令]
该命令用来指定基础镜像,在基础镜像的基础上修改数据从而构建新的镜像。基础镜像可以是本地仓库也可以是远程仓库。
指令有两种格式:
- FROM
image【默认为latest版本】 - FROM
image:tag【指定版本】
2.2. MAINTAINER(镜像创建者信息)[构建指令]
将镜像制作者(维护者)的信息写入image中,执行docker inspect时会输出该信息。
格式:MAINTAINER name
MAINTAINER命令已废弃,可使用maintainer label的方式。
LABEL maintainer="SvenDowideit@home.org.au"
2.3. RUN(安装软件用)[构建指令]
RUN可以运行任何被基础镜像支持的命令(即在基础镜像上执行一个进程),可以使用多条RUN指令,指令较长可以使用\来换行。
指令有两种格式:
- RUN
command(the command is run in a shell -/bin/sh -c) - RUN ["executable", "param1", "param2" ... ] (exec form)
- 指定使用其他终端实现,使用exec执行。
- 例子:RUN["/bin/bash","-c","echo hello"]
2.4. CMD(设置container启动时执行的操作)[设置指令]
用于容器启动时的指定操作,可以是自定义脚本或命令,只执行一次,多个默认执行最后一个。
指令有三种格式:
- CMD ["executable","param1","param2"] (like an exec, this is the preferred form)
- 运行一个可执行文件并提供参数。
- CMD command param1 param2 (as a shell)
- 直接执行shell命令,默认以/bin/sh -c执行。
- CMD ["param1","param2"] (as default parameters to ENTRYPOINT)
- 和ENTRYPOINT配合使用,只作为完整命令的参数部分。
2.5. ENTRYPOINT(设置container启动时执行的操作)[设置指令]
指定容器启动时执行的命令,若多次设置只执行最后一次。
ENTRYPOINT翻译为“进入点”,它的功能可以让容器表现得像一个可执行程序一样。
例子:ENTRYPOINT ["/bin/echo"] ,那么docker build出来的镜像以后的容器功能就像一个/bin/echo程序,docker run -it imageecho “this is a test”,就会输出对应的字符串。这个imageecho镜像对应的容器表现出来的功能就像一个echo程序一样。
指令有两种格式:
-
ENTRYPOINT ["executable", "param1", "param2"] (like an exec, the preferred form)
-
和CMD配合使用,CMD则作为完整命令的参数部分,ENTRYPOINT以JSON格式指定执行的命令部分。CMD可以为ENTRYPOINT提供可变参数,不需要变动的参数可以写在ENTRYPOINT里面。
-
例子:
ENTRYPOINT ["/usr/bin/ls","-a"]
CMD ["-l"]
-
-
ENTRYPOINT command param1 param2 (as a shell)
- 独自使用,即和CMD类似,如果CMD也是个完整命令[CMD command param1 param2 (as a shell) ],那么会相互覆盖,只执行最后一个CMD或ENTRYPOINT。
- 例子:ENTRYPOINT ls -l
2.6. USER(设置container容器启动的登录用户)[设置指令]
设置启动容器的用户,默认为root用户。
格式:USER daemon
2.7. EXPOSE(指定容器需要映射到宿主机的端口)[设置指令]
该指令会将容器中的端口映射为宿主机中的端口[确保宿主机的端口号没有被使用]。通过宿主机IP和映射后的端口即可访问容器[避免每次运行容器时IP随机生成不固定的问题]。前提是EXPOSE设置映射端口,运行容器时加上-p参数指定EXPOSE设置的端口。EXPOSE可以设置多个端口号,相应地运行容器配套多次使用-p参数。可以通过docker port +容器需要映射的端口号和容器ID来参考宿主机的映射端口。
格式:EXPOSE port [port...]
2.8. ENV(用于设置环境变量)[构建指令]
在image中设置环境变量[以键值对的形式],设置之后RUN命令可以使用该环境变量,在容器启动后也可以通过docker inspect查看环境变量或者通过 docker run --env key=value设置或修改环境变量。
格式:ENV key value
例子:ENV JAVA_HOME /path/to/java/dirent
2.9. ARG(用于设置变量)[构建指令]
ARG定义一个默认参数,可以在dockerfile中引用。构建阶段可以通过docker build --build-arg
ARG <arg_name>[=<default value>]
# 可以搭配ENV使用
ENV env_name ${arg_name}
示例:
docker build --build-arg user=what_user .
2.10. ADD(从src复制文件到container的dest路径)[构建指令]
复制指定的src到容器中的dest,其中src是相对被构建的源目录的相对路径,可以是文件或目录的路径,也可以是一个远程的文件url。dest 是container中的绝对路径。所有拷贝到container中的文件和文件夹权限为0755,uid和gid为0。
- 如果src是一个目录,那么会将该目录下的所有文件添加到container中,不包括目录;
- 如果src文件是可识别的压缩格式,则docker会帮忙解压缩(注意压缩格式);
- 如果
src是文件且dest中不使用斜杠结束,则会将dest视为文件,src的内容会写入dest; - 如果
src是文件且dest中使用斜杠结束,则会src文件拷贝到dest目录下。
格式:ADD src dest
为避免 ADD命令带来的未知风险和复杂性,可以使用COPY命令替代ADD命令
2.11. COPY(复制文件)
复制本地主机的src为容器中的dest,目标路径不存在时会自动创建。
格式:COPY src dest
2.12. VOLUME(指定挂载点)[设置指令]
创建一个可以从本地主机或其他容器挂载的挂载点,使容器中的一个目录具有持久化存储数据的功能,该目录可以被容器本身使用也可以被其他容器使用。
格式:VOLUME ["mountpoint"]
其他容器使用共享数据卷:docker run -t -i -rm -volumes-from container1 image2 bash [container1为第一个容器的ID,image2为第二个容器运行image的名字。]
2.13. WORKDIR(切换目录)[设置指令]
相当于cd命令,可以多次切换目录,为RUN,CMD,ENTRYPOINT配置工作目录。可以使用多个WORKDIR的命令,后续命令如果是相对路径则是在上一级路径的基础上执行[类似cd的功能]。
格式:WORKDIR /path/to/workdir
2.14. ONBUILD(在子镜像中执行)
当所创建的镜像作为其他新创建镜像的基础镜像时执行的操作命令,即在创建本镜像时不运行,当作为别人的基础镜像时再在构建时运行(可认为基础镜像为父镜像,而该命令即在它的子镜像构建时运行,相当于在子镜像构建时多加了一些命令)。
格式:ONBUILD Dockerfile关键字
3. dockerfile示例
最佳实践
- 镜像可以分为三层:系统基础镜像、业务基础镜像、业务镜像。
- 尽量将不变的镜像操作放dockerfile前面。
- 一类RUN命令操作可以通过
\和&&方式组合成一条RUN命令。 - dockerfile尽量清晰简洁。
文件目录
./
|-- Dockerfile
|-- docker-entrypoint.sh
|-- dumb-init
|-- conf # 配置文件路径
| `-- app_conf.py
|-- pkg # 安装包路径
| `-- install.tar.gz
|-- run.sh # 启动脚本
dockerfile示例
FROM centos:latest
LABEL maintainer="xxx@xxx.com"
ARG APP=appname
ENV APP ${APP}
# copy and install app
COPY conf/app_conf.py /usr/local/app/app_conf/app_conf.py
COPY pkg/${APP}-*-install.tar.gz /data/${APP}-install.tar.gz
RUN mkdir -p /data/${APP} \
&& tar -zxvf /data/${APP}-install.tar.gz -C /data/${APP} \
&& cd /data/${APP}/${APP}* \
&& ./install.sh
WORKDIR /usr/local/app/
# init
COPY dumb-init /usr/bin/dumb-init
COPY docker-entrypoint.sh /docker-entrypoint.sh
ENTRYPOINT ["/usr/bin/dumb-init", "--","/docker-entrypoint.sh"]
COPY run.sh /run.sh
RUN chmod +x /run.sh
CMD ["/run.sh"]
4. docker build
指定dockerfile文件构建
默认不指定dockerfile文件名,则读取指定路径的Dockerfile
docker build -t <image_name> -f <dockerfile_name> <dockerfile_path>
docker build --help
docker build --help
Usage: docker build [OPTIONS] PATH | URL | -
Build an image from a Dockerfile
Options:
--add-host list Add a custom host-to-IP mapping (host:ip)
--build-arg list Set build-time variables
--cache-from strings Images to consider as cache sources
--cgroup-parent string Optional parent cgroup for the container
--compress Compress the build context using gzip
--cpu-period int Limit the CPU CFS (Completely Fair Scheduler) period
--cpu-quota int Limit the CPU CFS (Completely Fair Scheduler) quota
-c, --cpu-shares int CPU shares (relative weight)
--cpuset-cpus string CPUs in which to allow execution (0-3, 0,1)
--cpuset-mems string MEMs in which to allow execution (0-3, 0,1)
--disable-content-trust Skip image verification (default true)
-f, --file string Name of the Dockerfile (Default is 'PATH/Dockerfile')
--force-rm Always remove intermediate containers
--iidfile string Write the image ID to the file
--isolation string Container isolation technology
--label list Set metadata for an image
-m, --memory bytes Memory limit
--memory-swap bytes Swap limit equal to memory plus swap: '-1' to enable unlimited swap
--network string Set the networking mode for the RUN instructions during build (default "default")
--no-cache Do not use cache when building the image
--pull Always attempt to pull a newer version of the image
-q, --quiet Suppress the build output and print image ID on success
--rm Remove intermediate containers after a successful build (default true)
--security-opt strings Security options
--shm-size bytes Size of /dev/shm
-t, --tag list Name and optionally a tag in the 'name:tag' format
--target string Set the target build stage to build.
--ulimit ulimit Ulimit options (default [])
参考:
12.3.8 -
1. CentOS 安装Docker
建议使用centos7
1.1. 安装Docker
1.1.1. 卸载旧版本
旧版本的Docker命名为docker或docker-engine,如果有安装旧版本,先卸载旧版本
$ sudo yum remove -y docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-selinux \
docker-engine-selinux \
docker-engine
1.1.2. 使用仓库安装
1、安装yum-utils、device-mapper-persistent-data、lvm2
$ sudo yum install -y yum-utils \
device-mapper-persistent-data \
lvm2
2、添加软件源
$ sudo yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
1.1.3. 安装Docker
安装最新版本的Docker CE。
$ sudo yum install -y docker-ce
1.1.4. 启动Docker
# 启动Docker
$ sudo systemctl start docker
# 运行容器
$ sudo docker run hello-world
1.2. 安装指定版本Docker
1、列出可安装版本
$ yum list docker-ce --showduplicates | sort -r
docker-ce.x86_64 18.03.0.ce-1.el7.centos docker-ce-stable
2、安装指定版本
例如:docker-ce-18.03.0.ce
$ sudo yum install docker-ce-<VERSION STRING>
1.3. 升级Docker
依据1.2的方法选择指定版本安装。
1.4. 卸载Docker
# 卸载Docker
$ sudo yum remove docker-ce
# 清理镜像、容器、存储卷等
$ sudo rm -rf /var/lib/docker
2. Ubuntu 安装Docker
2.1. 安装Docker
2.1.1. 卸载旧版本
旧版本的Docker命名为docker或docker-engine,如果有安装旧版本,先卸载旧版本
sudo apt-get remove docker docker-engine docker.io
2.1.2. 使用仓库安装
1、升级apt
sudo apt-get update
2、允许apt使用https
sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
software-properties-common
3、添加Docker 官方的GPG密钥
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
4、添加Docker软件源
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
2.1.3. 安装Docker
# update
sudo apt-get update
# install docker
sudo apt-get install docker-ce
2.1.4. 启动Docker
# 设置为开机启动
sudo systemctl enable docker
# 启动docker
sudo systemctl start docker
2.2. 安装指定版本Docker
1、列出仓库的可安装版本,apt-cache madison docker-ce。
# apt-cache madison docker-ce
docker-ce | 18.06.0~ce~3-0~ubuntu | https://download.docker.com/linux/ubuntu bionic/stable amd64 Packages
docker-ce | 18.03.1~ce~3-0~ubuntu | https://download.docker.com/linux/ubuntu bionic/stable amd64 Packages
2、指定版本安装
例如:docker-ce=18.03.0~ce-0~ubuntu
sudo apt-get install docker-ce=<VERSION>
2.3. 升级Docker
# 更新源
sudo apt-get update
# 依据上述方法,指定版本安装
2.4. 卸载Docker
# 卸载 docker ce
sudo apt-get purge docker-ce
# 清理镜像、容器、存储卷等
sudo rm -rf /var/lib/docker
3. 离线rpm包安装Docker
3.1. 下载docker rpm包
rpm包地址:https://mirrors.aliyun.com/docker-ce/linux/centos/7/x86_64/stable/Packages/
下载指定版本的containerd.io、docker-ce、docker-ce-cli
wget https://mirrors.aliyun.com/docker-ce/linux/centos/7/x86_64/stable/Packages/containerd.io-1.2.6-3.3.el7.x86_64.rpm
wget https://mirrors.aliyun.com/docker-ce/linux/centos/7/x86_64/stable/Packages/docker-ce-18.09.9-3.el7.x86_64.rpm
wget https://mirrors.aliyun.com/docker-ce/linux/centos/7/x86_64/stable/Packages/docker-ce-cli-18.09.9-3.el7.x86_64.rpm
下载container-selinux
地址:http://mirror.centos.org/centos/7/extras/x86_64/Packages/
wget http://mirror.centos.org/centos/7/extras/x86_64/Packages/container-selinux-2.107-3.el7.noarch.rpm
3.2. 安装rpm包
# container-selinux
rpm -ivh container-selinux*.rpm
# containerd.io
rpm -ivh containerd.io*.rpm
# docker-ce
rpm -ivh docker-ce*.rpm
# docker-ce-cli
rpm -ivh docker-ce-cli*.rpm
3.3. 启动docker服务
# 启动
systemctl start docker
# 查看状态
systemctl status docker
文章参考:
12.4 - Kata Container
12.4.1 - kata容器简介
Kata-container简介
kata-container通过轻量型虚拟机技术构建一个安全的容器运行时,表现像容器一样,但通硬件虚拟化技术提供强隔离,作为第二层的安全防护。
特点:
- 安全:独立的内核,提供网络、I/O、内存的隔离。
- 兼容性:支持OCI容器标准,k8s的CRI接口。
- 性能:兼容虚拟机的安全和容器的轻量特点。
- 简单:使用标准的接口。
1. kata-container架构
kata-container与传统container的比较
2. kata-runtime
Kata Containers runtime (kata-runtime)通过QEMU*/KVM技术创建了一种轻量型的虚拟机,兼容 OCI runtime specification 标准,支持Kubernetes* Container Runtime Interface (CRI)接口,可替换CRI shim runtime (runc) 通过k8s来创建pod或容器。
3. shim
shim类似Docker的 containerd-shim 或CRI-O的 conmon,主要用来监控和回收容器的进程,kata-shim需要处理所有的容器的IO流(stdout, stdin and stderr)和转发相关信号。
containerd-shim-kata-v2实现了Containerd Runtime V2 (Shim API),k8s可以通过containerd-shim-kata-v2(替代2N+1个shims[由一个containerd-shim和kata-shim组成])来创建pod。
4. kata-agent
在虚拟机内kata-agent作为一个daemon进程运行,并拉起容器的进程。kata-agent使用VIRTIO或VSOCK接口(QEMU在主机上暴露的socket文件)在guest虚拟机中运行gRPC服务器。kata-runtime通过grpc协议与kata-agent通信,向kata-agent发送管理容器的命令。该协议还用于容器和管理引擎(例如Docker Engine)之间传送I / O流(stdout,stderr,stdin)。
容器内所有的执行命令和相关的IO流都需要通过QEMU在宿主机暴露的virtio-serial或vsock接口,当使用VIRTIO的情况下,每个虚拟机会创建一个Kata Containers proxy (kata-proxy) 来处理命令和IO流。
kata-agent使用libcontainer 来管理容器的生命周期,复用了runc的部分代码。
5. kata-proxy
kata-proxy提供了 kata-shim 和 kata-runtime 与VM中的kata-agent通信的方式,其中通信方式是使用virtio-serial或vsock,默认是使用virtio-serial。
6. Hypervisor
kata-container通过QEMU/KVM来创建虚拟机给容器运行,可以支持多种hypervisors。
7. QEMU/KVM
待补充
参考文档:
12.4.2 - kata配置
1. 配置文件路径
默认的配置文件位于/usr/share/defaults/kata-containers/configuration.toml,如果/etc/kata-containers/configuration.toml的配置文件存在,则会替代默认的配置文件。
查看配置文件的路径命令如下:
# kata-runtime --kata-show-default-config-paths
/etc/kata-containers/configuration.toml
/usr/share/defaults/kata-containers/configuration.toml
指定自定义配置文件运行
kata-runtime --kata-config=/some/where/configuration.toml ...
2. kata-env
查看runtime使用到的环境参数,
kata-runtime kata-env
输出内容如下:
[Meta]
Version = "1.0.23"
[Runtime]
Debug = false
Trace = false
DisableGuestSeccomp = true
DisableNewNetNs = false
Path = "/usr/bin/kata-runtime"
[Runtime.Version]
Semver = "1.7.2"
Commit = "9b9282693cfbcf70d442916bea56771cc6fc3afe"
OCI = "1.0.1-dev"
[Runtime.Config]
Path = "/usr/share/defaults/kata-containers/configuration.toml"
[Hypervisor]
MachineType = "pc"
Version = "QEMU emulator version 2.11.0\nCopyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers"
Path = "/usr/bin/qemu-lite-system-x86_64"
BlockDeviceDriver = "virtio-scsi"
EntropySource = "/dev/urandom"
Msize9p = 8192
MemorySlots = 10
Debug = false
UseVSock = false
SharedFS = "virtio-9p"
[Image]
Path = "/usr/share/kata-containers/kata-containers-image_centos_1.7.2_agent_20190702.img"
[Kernel]
Path = "/usr/share/kata-containers/vmlinuz-4.19.28.42-6.1.container"
Parameters = "init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket systemd.mask=systemd-journald.service systemd.mask=systemd-journald.socket systemd.mask=systemd-journal-flush.service systemd.mask=systemd-journald-dev-log.socket systemd.mask=systemd-udevd.service systemd.mask=systemd-udevd.socket systemd.mask=systemd-udev-trigger.service systemd.mask=systemd-udevd-kernel.socket systemd.mask=systemd-udevd-control.socket systemd.mask=systemd-timesyncd.service systemd.mask=systemd-update-utmp.service systemd.mask=systemd-tmpfiles-setup.service systemd.mask=systemd-tmpfiles-cleanup.service systemd.mask=systemd-tmpfiles-cleanup.timer systemd.mask=tmp.mount systemd.mask=systemd-random-seed.service systemd.mask=systemd-coredump@.service"
[Initrd]
Path = ""
[Proxy]
Type = "kataProxy"
Version = "kata-proxy version 1.7.2-a56df7c"
Path = "/usr/libexec/kata-containers/kata-proxy"
Debug = false
[Shim]
Type = "kataShim"
Version = "kata-shim version 1.7.2-2ea178c"
Path = "/usr/libexec/kata-containers/kata-shim"
Debug = false
[Agent]
Type = "kata"
Debug = false
Trace = false
TraceMode = ""
TraceType = ""
[Host]
Kernel = "4.14.105-1-tlinux3-0008"
Architecture = "amd64"
VMContainerCapable = true
SupportVSocks = true
[Host.Distro]
Name = "Tencent tlinux"
Version = "2.2"
[Host.CPU]
Vendor = "GenuineIntel"
Model = "Intel(R) Xeon(R) CPU X3440 @ 2.53GHz"
[Netmon]
Version = "kata-netmon version 1.7.2"
Path = "/usr/libexec/kata-containers/kata-netmon"
Debug = false
Enable = false
3. configuration.toml
# Copyright (c) 2017-2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
# XXX: WARNING: this file is auto-generated.
# XXX:
# XXX: Source file: "cli/config/configuration-qemu.toml.in"
# XXX: Project:
# XXX: Name: Kata Containers
# XXX: Type: kata
[hypervisor.qemu]
path = "/usr/bin/qemu-lite-system-x86_64"
kernel = "/usr/share/kata-containers/vmlinuz.container"
image = "/usr/share/kata-containers/kata-containers.img"
machine_type = "pc"
# Optional space-separated list of options to pass to the guest kernel.
# For example, use `kernel_params = "vsyscall=emulate"` if you are having
# trouble running pre-2.15 glibc.
#
# WARNING: - any parameter specified here will take priority over the default
# parameter value of the same name used to start the virtual machine.
# Do not set values here unless you understand the impact of doing so as you
# may stop the virtual machine from booting.
# To see the list of default parameters, enable hypervisor debug, create a
# container and look for 'default-kernel-parameters' log entries.
kernel_params = ""
# Path to the firmware.
# If you want that qemu uses the default firmware leave this option empty
firmware = ""
# Machine accelerators
# comma-separated list of machine accelerators to pass to the hypervisor.
# For example, `machine_accelerators = "nosmm,nosmbus,nosata,nopit,static-prt,nofw"`
machine_accelerators=""
# Default number of vCPUs per SB/VM:
# unspecified or 0 --> will be set to 1
# < 0 --> will be set to the actual number of physical cores
# > 0 <= number of physical cores --> will be set to the specified number
# > number of physical cores --> will be set to the actual number of physical cores
default_vcpus = 1
# Default maximum number of vCPUs per SB/VM:
# unspecified or == 0 --> will be set to the actual number of physical cores or to the maximum number
# of vCPUs supported by KVM if that number is exceeded
# > 0 <= number of physical cores --> will be set to the specified number
# > number of physical cores --> will be set to the actual number of physical cores or to the maximum number
# of vCPUs supported by KVM if that number is exceeded
# WARNING: Depending of the architecture, the maximum number of vCPUs supported by KVM is used when
# the actual number of physical cores is greater than it.
# WARNING: Be aware that this value impacts the virtual machine's memory footprint and CPU
# the hotplug functionality. For example, `default_maxvcpus = 240` specifies that until 240 vCPUs
# can be added to a SB/VM, but the memory footprint will be big. Another example, with
# `default_maxvcpus = 8` the memory footprint will be small, but 8 will be the maximum number of
# vCPUs supported by the SB/VM. In general, we recommend that you do not edit this variable,
# unless you know what are you doing.
default_maxvcpus = 0
# Bridges can be used to hot plug devices.
# Limitations:
# * Currently only pci bridges are supported
# * Until 30 devices per bridge can be hot plugged.
# * Until 5 PCI bridges can be cold plugged per VM.
# This limitation could be a bug in qemu or in the kernel
# Default number of bridges per SB/VM:
# unspecified or 0 --> will be set to 1
# > 1 <= 5 --> will be set to the specified number
# > 5 --> will be set to 5
default_bridges = 1
# Default memory size in MiB for SB/VM.
# If unspecified then it will be set 2048 MiB.
default_memory = 2048
#
# Default memory slots per SB/VM.
# If unspecified then it will be set 10.
# This is will determine the times that memory will be hotadded to sandbox/VM.
#memory_slots = 10
# The size in MiB will be plused to max memory of hypervisor.
# It is the memory address space for the NVDIMM devie.
# If set block storage driver (block_device_driver) to "nvdimm",
# should set memory_offset to the size of block device.
# Default 0
#memory_offset = 0
# Disable block device from being used for a container's rootfs.
# In case of a storage driver like devicemapper where a container's
# root file system is backed by a block device, the block device is passed
# directly to the hypervisor for performance reasons.
# This flag prevents the block device from being passed to the hypervisor,
# 9pfs is used instead to pass the rootfs.
disable_block_device_use = false
# Shared file system type:
# - virtio-9p (default)
# - virtio-fs
shared_fs = "virtio-9p"
# Path to vhost-user-fs daemon.
virtio_fs_daemon = "/usr/bin/virtiofsd-x86_64"
# Default size of DAX cache in MiB
virtio_fs_cache_size = 1024
# Cache mode:
#
# - none
# Metadata, data, and pathname lookup are not cached in guest. They are
# always fetched from host and any changes are immediately pushed to host.
#
# - auto
# Metadata and pathname lookup cache expires after a configured amount of
# time (default is 1 second). Data is cached while the file is open (close
# to open consistency).
#
# - always
# Metadata, data, and pathname lookup are cached in guest and never expire.
virtio_fs_cache = "always"
# Block storage driver to be used for the hypervisor in case the container
# rootfs is backed by a block device. This is virtio-scsi, virtio-blk
# or nvdimm.
block_device_driver = "virtio-scsi"
# Specifies cache-related options will be set to block devices or not.
# Default false
#block_device_cache_set = true
# Specifies cache-related options for block devices.
# Denotes whether use of O_DIRECT (bypass the host page cache) is enabled.
# Default false
#block_device_cache_direct = true
# Specifies cache-related options for block devices.
# Denotes whether flush requests for the device are ignored.
# Default false
#block_device_cache_noflush = true
# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently only implemented
# for SCSI.
#
enable_iothreads = false
# Enable pre allocation of VM RAM, default false
# Enabling this will result in lower container density
# as all of the memory will be allocated and locked
# This is useful when you want to reserve all the memory
# upfront or in the cases where you want memory latencies
# to be very predictable
# Default false
#enable_mem_prealloc = true
# Enable huge pages for VM RAM, default false
# Enabling this will result in the VM memory
# being allocated using huge pages.
# This is useful when you want to use vhost-user network
# stacks within the container. This will automatically
# result in memory pre allocation
#enable_hugepages = true
# Enable swap of vm memory. Default false.
# The behaviour is undefined if mem_prealloc is also set to true
#enable_swap = true
# This option changes the default hypervisor and kernel parameters
# to enable debug output where available. This extra output is added
# to the proxy logs, but only when proxy debug is also enabled.
#
# Default false
#enable_debug = true
# Disable the customizations done in the runtime when it detects
# that it is running on top a VMM. This will result in the runtime
# behaving as it would when running on bare metal.
#
#disable_nesting_checks = true
# This is the msize used for 9p shares. It is the number of bytes
# used for 9p packet payload.
#msize_9p = 8192
# If true and vsocks are supported, use vsocks to communicate directly
# with the agent and no proxy is started, otherwise use unix
# sockets and start a proxy to communicate with the agent.
# Default false
#use_vsock = true
# VFIO devices are hotplugged on a bridge by default.
# Enable hotplugging on root bus. This may be required for devices with
# a large PCI bar, as this is a current limitation with hotplugging on
# a bridge. This value is valid for "pc" machine type.
# Default false
#hotplug_vfio_on_root_bus = true
# If host doesn't support vhost_net, set to true. Thus we won't create vhost fds for nics.
# Default false
#disable_vhost_net = true
#
# Default entropy source.
# The path to a host source of entropy (including a real hardware RNG)
# /dev/urandom and /dev/random are two main options.
# Be aware that /dev/random is a blocking source of entropy. If the host
# runs out of entropy, the VMs boot time will increase leading to get startup
# timeouts.
# The source of entropy /dev/urandom is non-blocking and provides a
# generally acceptable source of entropy. It should work well for pretty much
# all practical purposes.
#entropy_source= "/dev/urandom"
# Path to OCI hook binaries in the *guest rootfs*.
# This does not affect host-side hooks which must instead be added to
# the OCI spec passed to the runtime.
#
# You can create a rootfs with hooks by customizing the osbuilder scripts:
# https://github.com/kata-containers/osbuilder
#
# Hooks must be stored in a subdirectory of guest_hook_path according to their
# hook type, i.e. "guest_hook_path/{prestart,postart,poststop}".
# The agent will scan these directories for executable files and add them, in
# lexicographical order, to the lifecycle of the guest container.
# Hooks are executed in the runtime namespace of the guest. See the official documentation:
# https://github.com/opencontainers/runtime-spec/blob/v1.0.1/config.md#posix-platform-hooks
# Warnings will be logged if any error is encountered will scanning for hooks,
# but it will not abort container execution.
#guest_hook_path = "/usr/share/oci/hooks"
[factory]
# VM templating support. Once enabled, new VMs are created from template
# using vm cloning. They will share the same initial kernel, initramfs and
# agent memory by mapping it readonly. It helps speeding up new container
# creation and saves a lot of memory if there are many kata containers running
# on the same host.
#
# When disabled, new VMs are created from scratch.
#
# Note: Requires "initrd=" to be set ("image=" is not supported).
#
# Default false
#enable_template = true
# Specifies the path of template.
#
# Default "/run/vc/vm/template"
#template_path = "/run/vc/vm/template"
# The number of caches of VMCache:
# unspecified or == 0 --> VMCache is disabled
# > 0 --> will be set to the specified number
#
# VMCache is a function that creates VMs as caches before using it.
# It helps speed up new container creation.
# The function consists of a server and some clients communicating
# through Unix socket. The protocol is gRPC in protocols/cache/cache.proto.
# The VMCache server will create some VMs and cache them by factory cache.
# It will convert the VM to gRPC format and transport it when gets
# requestion from clients.
# Factory grpccache is the VMCache client. It will request gRPC format
# VM and convert it back to a VM. If VMCache function is enabled,
# kata-runtime will request VM from factory grpccache when it creates
# a new sandbox.
#
# Default 0
#vm_cache_number = 0
# Specify the address of the Unix socket that is used by VMCache.
#
# Default /var/run/kata-containers/cache.sock
#vm_cache_endpoint = "/var/run/kata-containers/cache.sock"
[proxy.kata]
path = "/usr/libexec/kata-containers/kata-proxy"
# If enabled, proxy messages will be sent to the system log
# (default: disabled)
#enable_debug = true
[shim.kata]
path = "/usr/libexec/kata-containers/kata-shim"
# If enabled, shim messages will be sent to the system log
# (default: disabled)
#enable_debug = true
# If enabled, the shim will create opentracing.io traces and spans.
# (See https://www.jaegertracing.io/docs/getting-started).
#
# Note: By default, the shim runs in a separate network namespace. Therefore,
# to allow it to send trace details to the Jaeger agent running on the host,
# it is necessary to set 'disable_new_netns=true' so that it runs in the host
# network namespace.
#
# (default: disabled)
#enable_tracing = true
[agent.kata]
# If enabled, make the agent display debug-level messages.
# (default: disabled)
#enable_debug = true
# Enable agent tracing.
#
# If enabled, the default trace mode is "dynamic" and the
# default trace type is "isolated". The trace mode and type are set
# explicity with the `trace_type=` and `trace_mode=` options.
#
# Notes:
#
# - Tracing is ONLY enabled when `enable_tracing` is set: explicitly
# setting `trace_mode=` and/or `trace_type=` without setting `enable_tracing`
# will NOT activate agent tracing.
#
# - See https://github.com/kata-containers/agent/blob/master/TRACING.md for
# full details.
#
# (default: disabled)
#enable_tracing = true
#
#trace_mode = "dynamic"
#trace_type = "isolated"
[netmon]
# If enabled, the network monitoring process gets started when the
# sandbox is created. This allows for the detection of some additional
# network being added to the existing network namespace, after the
# sandbox has been created.
# (default: disabled)
#enable_netmon = true
# Specify the path to the netmon binary.
path = "/usr/libexec/kata-containers/kata-netmon"
# If enabled, netmon messages will be sent to the system log
# (default: disabled)
#enable_debug = true
[runtime]
# If enabled, the runtime will log additional debug messages to the
# system log
# (default: disabled)
#enable_debug = true
#
# Internetworking model
# Determines how the VM should be connected to the
# the container network interface
# Options:
#
# - bridged
# Uses a linux bridge to interconnect the container interface to
# the VM. Works for most cases except macvlan and ipvlan.
#
# - macvtap
# Used when the Container network interface can be bridged using
# macvtap.
#
# - none
# Used when customize network. Only creates a tap device. No veth pair.
#
# - tcfilter
# Uses tc filter rules to redirect traffic from the network interface
# provided by plugin to a tap interface connected to the VM.
#
internetworking_model="tcfilter"
# disable guest seccomp
# Determines whether container seccomp profiles are passed to the virtual
# machine and applied by the kata agent. If set to true, seccomp is not applied
# within the guest
# (default: true)
disable_guest_seccomp=true
# If enabled, the runtime will create opentracing.io traces and spans.
# (See https://www.jaegertracing.io/docs/getting-started).
# (default: disabled)
#enable_tracing = true
# If enabled, the runtime will not create a network namespace for shim and hypervisor processes.
# This option may have some potential impacts to your host. It should only be used when you know what you're doing.
# `disable_new_netns` conflicts with `enable_netmon`
# `disable_new_netns` conflicts with `internetworking_model=bridged` and `internetworking_model=macvtap`. It works only
# with `internetworking_model=none`. The tap device will be in the host network namespace and can connect to a bridge
# (like OVS) directly.
# If you are using docker, `disable_new_netns` only works with `docker run --net=none`
# (default: false)
#disable_new_netns = true
# Enabled experimental feature list, format: ["a", "b"].
# Experimental features are features not stable enough for production,
# They may break compatibility, and are prepared for a big version bump.
# Supported experimental features:
# 1. "newstore": new persist storage driver which breaks backward compatibility,
# expected to move out of experimental in 2.0.0.
# (default: [])
experimental=[]
参考:
12.5 - GPU
12.5.1 - nvidia-device-plugin介绍
1. 简介
NVIDIA device plugin 通过k8s daemonset的方式部署到每个k8s的node节点上,实现了Kubernetes device plugin的接口。
提供以下功能:
- 暴露每个节点的GPU数量给集群
- 跟踪GPU的健康情况
- 使在k8s的节点可以运行GPU容器
2. 要求
- NVIDIA drivers ~= 384.81
- nvidia-docker version > 2.0 (see how to install and it's prerequisites)
- docker configured with nvidia as the default runtime.
- Kubernetes version >= 1.10
3. 使用
3.1. 安装NVIDIA drivers和nvidia-docker
提供GPU节点的机器,准备工作如下
- 安装NVIDIA drivers ~= 384.81
- 安装nvidia-docker version > 2.0
3.2. 配置docker runtime
配置nvidia runtime作为GPU节点的默认runtime。
修改文件/etc/docker/daemon.json,增加以下runtime内容。
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
3.3. 部署nvidia-device-plugin
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
nvidia-device-plugin的daemonset yaml文件如下:
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvidia/k8s-device-plugin:1.0.0-beta4
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
3.4. 运行GPU任务
创建一个GPU的pod,pod的资源类型指定为nvidia.com/gpu。
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
4. 构建和运行nvidia-device-plugin
4.1. docker方式
4.1.1. 编译
- 直接拉取dockerhub的镜像
$ docker pull nvidia/k8s-device-plugin:1.0.0-beta4
- 拉取代码构建镜像
$ docker build -t nvidia/k8s-device-plugin:1.0.0-beta4 https://github.com/NVIDIA/k8s-device-plugin.git#1.0.0-beta4
- 修改nvidia-device-plugin后构建镜像
$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin
$ git checkout 1.0.0-beta4
$ docker build -t nvidia/k8s-device-plugin:1.0.0-beta4 .
4.1.2. 运行
- docker本地运行
$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.0.0-beta4
- daemonset运行
$ kubectl create -f nvidia-device-plugin.yml
4.2. 非docker方式
4.2.1. 编译
$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build
4.2.2. 本地运行
$ ./k8s-device-plugin
参考:
13 - Etcd集群
13.1 - Etcd介绍
1. Etcd是什么(what)
etcd is a distributed, consistent key-value store for shared configuration and service discovery, with a focus on being:
- Secure: automatic TLS with optional client cert authentication[可选的SSL客户端证书认证:支持https访问 ]
- Fast: benchmarked 10,000 writes/sec[单实例每秒 1000 次写操作]
- Reliable: properly distributed using Raft[使用Raft保证一致性]
etcd是一个分布式、一致性的键值存储系统,主要用于配置共享和服务发现。[以上内容来自etcd官网]
2. 为什么使用Etcd(why)
2.1. Etcd的优势
- 简单。使用Go语言编写部署简单;使用HTTP作为接口使用简单;使用Raft算法保证强一致性让用户易于理解。
- 数据持久化。etcd默认数据一更新就进行持久化。
- 安全。etcd支持SSL客户端安全认证。
3. 如何实现Etcd架构(how)
3.1. Etcd的相关名词解释
- Raft:etcd所采用的保证分布式系统强一致性的算法。
- Node:一个Raft状态机实例。
- Member: 一个etcd实例。它管理着一个Node,并且可以为客户端请求提供服务。
- Cluster:由多个Member构成可以协同工作的etcd集群。
- Peer:对同一个etcd集群中另外一个Member的称呼。
- Client: 向etcd集群发送HTTP请求的客户端。
- WAL:预写式日志,etcd用于持久化存储的日志格式。
- snapshot:etcd防止WAL文件过多而设置的快照,存储etcd数据状态。
- Proxy:etcd的一种模式,为etcd集群提供反向代理服务。
- Leader:Raft算法中通过竞选而产生的处理所有数据提交的节点。
- Follower:竞选失败的节点作为Raft中的从属节点,为算法提供强一致性保证。
- Candidate:当Follower超过一定时间接收不到Leader的心跳时转变为Candidate开始竞选。【候选人】
- Term:某个节点成为Leader到下一次竞选时间,称为一个Term。【任期】
- Index:数据项编号。Raft中通过Term和Index来定位数据。
3.2. Etcd的架构图

一个用户的请求发送过来,会经由HTTP Server转发给Store进行具体的事务处理,如果涉及到节点的修改,则交给Raft模块进行状态的变更、日志的记录,然后再同步给别的etcd节点以确认数据提交,最后进行数据的提交,再次同步。
1、HTTP Server:
用于处理用户发送的API请求以及其它etcd节点的同步与心跳信息请求。
2、Raft:
Raft强一致性算法的具体实现,是etcd的核心。
3、WAL:
Write Ahead Log(预写式日志),是etcd的数据存储方式,用于系统提供原子性和持久性的一系列技术。除了在内存中存有所有数据的状态以及节点的索引以外,etcd就通过WAL进行持久化存储。WAL中,所有的数据提交前都会事先记录日志。
-
Entry[日志内容]:
负责存储具体日志的内容。
-
Snapshot[快照内容]:
Snapshot是为了防止数据过多而进行的状态快照,日志内容发生变化时保存Raft的状态。
4、Store:
用于处理etcd支持的各类功能的事务,包括数据索引、节点状态变更、监控与反馈、事件处理与执行等等,是etcd对用户提供的大多数API功能的具体实现。
13.2 - Raft算法
1. Raft协议[分布式一致性算法]

raft算法中涉及三种角色,分别是:
follower: 跟随者candidate: 候选者,选举过程中的中间状态角色leader: 领导者
2. 过程
2.1. 选举
有两个timeout来控制选举,第一个是election timeout,该时间是节点从follower到成为candidate的时间,该时间是150到300毫秒之间的随机值。另一个是heartbeat timeout。
- 当某个节点经历完
election timeout成为candidate后,开启新的一个选举周期,他向其他节点发起投票请求(Request Vote),如果接收到消息的节点在该周期内还没投过票则给这个candidate投票,然后节点重置他的election timeout。 - 当该candidate获得大部分的选票,则可以当选为leader。
- leader就开始发送
append entries给其他follower节点,这个消息会在内部指定的heartbeat timeout时间内发出,follower收到该信息则响应给leader。 - 这个选举周期会继续,直到某个follower没有收到心跳,并成为candidate。
- 如果某个选举周期内,有两个candidate同时获得相同多的选票,则会等待一个新的周期重新选举。
2.2. 同步
当选举过程结束,选出了leader,则leader需要把所有的变更同步的系统中的其他节点,该同步也是通过发送Append Entries的消息的方式。
- 首先一个客户端发送一个更新给leader,这个更新会添加到leader的日志中。
- 然后leader会在给follower的下次心跳探测中发送该更新。
- 一旦大多数follower收到这个更新并返回给leader,leader提交这个更新,然后返回给客户端。
2.3. 网络分区
- 当发生网络分区的时候,在不同分区的节点接收不到leader的心跳,则会开启一轮选举,形成不同leader的多个分区集群。
- 当客户端给不同leader的发送更新消息时,不同分区集群中的节点个数小于原先集群的一半时,更新不会被提交,而节点个数大于集群数一半时,更新会被提交。
- 当网络分区恢复后,被提交的更新会同步到其他的节点上,其他节点未提交的日志会被回滚并匹配新leader的日志,保证全局的数据是一致的。
参考:
13.3 - etcdctl命令工具
13.3.1 - etcdctl-V3
etcdctl的v3版本与v2版本使用命令有所不同,本文介绍etcdctl v3版本的命令工具的使用方式。
1. etcdctl的安装
etcdctl的二进制文件可以在 github.com/coreos/etcd/releases 选择对应的版本下载,例如可以执行以下install_etcdctl.sh的脚本,修改其中的版本信息。
#!/bin/bash
ETCD_VER=v3.3.4
ETCD_DIR=etcd-download
DOWNLOAD_URL=https://github.com/coreos/etcd/releases/download
# Download
mkdir ${ETCD_DIR}
cd ${ETCD_DIR}
wget ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar -xzvf etcd-${ETCD_VER}-linux-amd64.tar.gz
# install
cd etcd-${ETCD_VER}-linux-amd64
cp etcdctl /usr/local/bin/
2. etcdctl V3
使用etcdctlv3的版本时,需设置环境变量ETCDCTL_API=3。
export ETCDCTL_API=3
# 或者在`/etc/profile`文件中添加环境变量
vi /etc/profile
...
export ETCDCTL_API=3
...
source /etc/profile
# 或者在命令执行前加 ETCDCTL_API=3
ETCDCTL_API=3 etcdctl --endpoints=$ENDPOINTS member list
查看当前etcdctl的版本信息etcdctl version。
[root@k8s-dbg-master-1 etcd]# etcdctl version
etcdctl version: 3.3.4
API version: 3.3
更多命令帮助可以查询etcdctl —help。
[root@k8s-dbg-master-1 etcd]# etcdctl --help
NAME:
etcdctl - A simple command line client for etcd3.
USAGE:
etcdctl
VERSION:
3.3.4
API VERSION:
3.3
COMMANDS:
get Gets the key or a range of keys
put Puts the given key into the store
del Removes the specified key or range of keys [key, range_end)
txn Txn processes all the requests in one transaction
compaction Compacts the event history in etcd
alarm disarm Disarms all alarms
alarm list Lists all alarms
defrag Defragments the storage of the etcd members with given endpoints
endpoint health Checks the healthiness of endpoints specified in `--endpoints` flag
endpoint status Prints out the status of endpoints specified in `--endpoints` flag
endpoint hashkv Prints the KV history hash for each endpoint in --endpoints
move-leader Transfers leadership to another etcd cluster member.
watch Watches events stream on keys or prefixes
version Prints the version of etcdctl
lease grant Creates leases
lease revoke Revokes leases
lease timetolive Get lease information
lease list List all active leases
lease keep-alive Keeps leases alive (renew)
member add Adds a member into the cluster
member remove Removes a member from the cluster
member update Updates a member in the cluster
member list Lists all members in the cluster
snapshot save Stores an etcd node backend snapshot to a given file
snapshot restore Restores an etcd member snapshot to an etcd directory
snapshot status Gets backend snapshot status of a given file
make-mirror Makes a mirror at the destination etcd cluster
migrate Migrates keys in a v2 store to a mvcc store
lock Acquires a named lock
elect Observes and participates in leader election
auth enable Enables authentication
auth disable Disables authentication
user add Adds a new user
user delete Deletes a user
user get Gets detailed information of a user
user list Lists all users
user passwd Changes password of user
user grant-role Grants a role to a user
user revoke-role Revokes a role from a user
role add Adds a new role
role delete Deletes a role
role get Gets detailed information of a role
role list Lists all roles
role grant-permission Grants a key to a role
role revoke-permission Revokes a key from a role
check perf Check the performance of the etcd cluster
help Help about any command
OPTIONS:
--cacert="" verify certificates of TLS-enabled secure servers using this CA bundle
--cert="" identify secure client using this TLS certificate file
--command-timeout=5s timeout for short running command (excluding dial timeout)
--debug[=false] enable client-side debug logging
--dial-timeout=2s dial timeout for client connections
-d, --discovery-srv="" domain name to query for SRV records describing cluster endpoints
--endpoints=[127.0.0.1:2379] gRPC endpoints
--hex[=false] print byte strings as hex encoded strings
--insecure-discovery[=true] accept insecure SRV records describing cluster endpoints
--insecure-skip-tls-verify[=false] skip server certificate verification
--insecure-transport[=true] disable transport security for client connections
--keepalive-time=2s keepalive time for client connections
--keepalive-timeout=6s keepalive timeout for client connections
--key="" identify secure client using this TLS key file
--user="" username[:password] for authentication (prompt if password is not supplied)
-w, --write-out="simple" set the output format (fields, json, protobuf, simple, table)
3. etcdctl 常用命令
3.1. 指定etcd集群
HOST_1=10.240.0.17
HOST_2=10.240.0.18
HOST_3=10.240.0.19
ENDPOINTS=$HOST_1:2379,$HOST_2:2379,$HOST_3:2379
etcdctl --endpoints=$ENDPOINTS member list
如果etcd设置了证书访问,则需要添加证书相关参数:
ETCDCTL_API=3 etcdctl --endpoints=$ENDPOINTS --cacert=<ca-file> --cert=<cert-file> --key=<key-file> <command>
参数说明如下:
--cacert="" verify certificates of TLS-enabled secure servers using this CA bundle
--cert="" identify secure client using this TLS certificate file
--key="" identify secure client using this TLS key file
--endpoints=[127.0.0.1:2379] gRPC endpoints
可以自定义alias命令
# alias 命令,避免每次需要输入证书参数
alias ectl='ETCDCTL_API=3 etcdctl --endpoints=$ENDPOINTS --cacert=<ca-file> --cert=<cert-file> --key=<key-file>'
# 直接使用别名执行命令
ectl <command>
3.2. 增删改查
1、增
etcdctl --endpoints=$ENDPOINTS put foo "Hello World!"
2、查
etcdctl --endpoints=$ENDPOINTS get foo
etcdctl --endpoints=$ENDPOINTS --write-out="json" get foo
基于相同前缀查找
etcdctl --endpoints=$ENDPOINTS put web1 value1
etcdctl --endpoints=$ENDPOINTS put web2 value2
etcdctl --endpoints=$ENDPOINTS put web3 value3
etcdctl --endpoints=$ENDPOINTS get web --prefix
列出所有的key
etcdctl --endpoints=$ENDPOINTS get / --prefix --keys-only
3、删
etcdctl --endpoints=$ENDPOINTS put key myvalue
etcdctl --endpoints=$ENDPOINTS del key
etcdctl --endpoints=$ENDPOINTS put k1 value1
etcdctl --endpoints=$ENDPOINTS put k2 value2
etcdctl --endpoints=$ENDPOINTS del k --prefix
3.3. 集群状态
集群状态主要是etcdctl endpoint status 和etcdctl endpoint health两条命令。
etcdctl --write-out=table --endpoints=$ENDPOINTS endpoint status
+------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------+------------------+---------+---------+-----------+-----------+------------+
| 10.240.0.17:2379 | 4917a7ab173fabe7 | 3.0.0 | 45 kB | true | 4 | 16726 |
| 10.240.0.18:2379 | 59796ba9cd1bcd72 | 3.0.0 | 45 kB | false | 4 | 16726 |
| 10.240.0.19:2379 | 94df724b66343e6c | 3.0.0 | 45 kB | false | 4 | 16726 |
+------------------+------------------+---------+---------+-----------+-----------+------------+
etcdctl --endpoints=$ENDPOINTS endpoint health
10.240.0.17:2379 is healthy: successfully committed proposal: took = 3.345431ms
10.240.0.19:2379 is healthy: successfully committed proposal: took = 3.767967ms
10.240.0.18:2379 is healthy: successfully committed proposal: took = 4.025451ms
3.4. 集群成员
跟集群成员相关的命令如下:
member add Adds a member into the cluster
member remove Removes a member from the cluster
member update Updates a member in the cluster
member list Lists all members in the cluster
例如 etcdctl member list列出集群成员的命令。
etcdctl --endpoints=http://172.16.5.4:12379 member list -w table
+-----------------+---------+-------+------------------------+-----------------------------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+-----------------+---------+-------+------------------------+-----------------------------------------------+
| c856d92a82ba66a | started | etcd0 | http://172.16.5.4:2380 | http://172.16.5.4:2379,http://172.16.5.4:4001 |
+-----------------+---------+-------+------------------------+-----------------------------------------------+
4. etcdctl get
使用etcdctl {command} --help可以查看具体命令的帮助信息。
# etcdctl get --help
NAME:
get - Gets the key or a range of keys
USAGE:
etcdctl get [options] <key> [range_end]
OPTIONS:
--consistency="l" Linearizable(l) or Serializable(s)
--from-key[=false] Get keys that are greater than or equal to the given key using byte compare
--keys-only[=false] Get only the keys
--limit=0 Maximum number of results
--order="" Order of results; ASCEND or DESCEND (ASCEND by default)
--prefix[=false] Get keys with matching prefix
--print-value-only[=false] Only write values when using the "simple" output format
--rev=0 Specify the kv revision
--sort-by="" Sort target; CREATE, KEY, MODIFY, VALUE, or VERSION
GLOBAL OPTIONS:
--cacert="" verify certificates of TLS-enabled secure servers using this CA bundle
--cert="" identify secure client using this TLS certificate file
--command-timeout=5s timeout for short running command (excluding dial timeout)
--debug[=false] enable client-side debug logging
--dial-timeout=2s dial timeout for client connections
--endpoints=[127.0.0.1:2379] gRPC endpoints
--hex[=false] print byte strings as hex encoded strings
--insecure-skip-tls-verify[=false] skip server certificate verification
--insecure-transport[=true] disable transport security for client connections
--key="" identify secure client using this TLS key file
--user="" username[:password] for authentication (prompt if password is not supplied)
-w, --write-out="simple" set the output format (fields, json, protobuf, simple, table)
文章参考:
13.3.2 - etcdctl-V2
1. etcdctl介绍
etcdctl是一个命令行的客户端,它提供了一下简洁的命令,可理解为命令工具集,可以方便我们在对服务进行测试或者手动修改数据库内容。etcdctl与其他xxxctl的命令原理及操作类似(例如kubectl,systemctl)。
用法:etcdctl [global options] command [command options][args...]
2. Etcd常用命令
2.1. 数据库操作命令
etcd 在键的组织上采用了层次化的空间结构(类似于文件系统中目录的概念),数据库操作围绕对键值和目录的 CRUD [增删改查](符合 REST 风格的一套操作:Create, Read, Update, Delete)完整生命周期的管理。
具体的命令选项参数可以通过 etcdctl command --help来获取相关帮助。
2.1.1. 对象为键值
-
set[增:无论是否存在]:etcdctl set key value
-
mk[增:必须不存在]:etcdctl mk key value
-
rm[删]:etcdctl rm key
-
update[改]:etcdctl update key value
-
get[查]:etcdctl get key
2.1.2. 对象为目录
-
setdir[增:无论是否存在]:etcdctl setdir dir
-
mkdir[增:必须不存在]: etcdctl mkdir dir
-
rmdir[删]:etcdctl rmdir dir
-
updatedir[改]:etcdctl updatedir dir
-
ls[查]:etcdclt ls
2.2. 非数据库操作命令
-
backup[备份 etcd 的数据]
etcdctl backup
-
watch[监测一个键值的变化,一旦键值发生更新,就会输出最新的值并退出]
etcdctl watch key
-
exec-watch[监测一个键值的变化,一旦键值发生更新,就执行给定命令]
etcdctl exec-watch key --sh -c "ls"
-
member[通过 list、add、remove、update 命令列出、添加、删除 、更新etcd 实例到 etcd 集群中]
etcdctl member list;etcdctl member add 实例;etcdctl member remove 实例;etcdctl member update 实例。
-
etcdctl cluster-health[检查集群健康状态]
2.3. 常用配置参数
设置配置文件,默认为/etc/etcd/etcd.conf。
| 配置参数 | 参数说明 |
|---|---|
| 配置参数 | 参数说明 |
| -name | 节点名称 |
| -data-dir | 保存日志和快照的目录,默认为当前工作目录,指定节点的数据存储目录 |
| -addr | 公布的ip地址和端口。 默认为127.0.0.1:2379 |
| -bind-addr | 用于客户端连接的监听地址,默认为-addr配置 |
| -peers | 集群成员逗号分隔的列表,例如 127.0.0.1:2380,127.0.0.1:2381 |
| -peer-addr | 集群服务通讯的公布的IP地址,默认为 127.0.0.1:2380. |
| -peer-bind-addr | 集群服务通讯的监听地址,默认为-peer-addr配置 |
| -wal-dir | 指定节点的was文件的存储目录,若指定了该参数,wal文件会和其他数据文件分开存储 |
| -listen-client-urls | |
| -listen-peer-urls | 监听URL,用于与其他节点通讯 |
| -initial-advertise-peer-urls | 告知集群其他节点url. |
| -advertise-client-urls | 告知客户端url, 也就是服务的url |
| -initial-cluster-token | 集群的ID |
| -initial-cluster | 集群中所有节点 |
| -initial-cluster-state | -initial-cluster-state=new 表示从无到有搭建etcd集群 |
| -discovery-srv | 用于DNS动态服务发现,指定DNS SRV域名 |
| -discovery | 用于etcd动态发现,指定etcd发现服务的URL [https://discovery.etcd.io/],用环境变量表示 |
13.4 - Etcd访问控制
1. ETCD资源类型
There are three types of resources in etcd
permission resources: users and roles in the user storekey-value resources: key-value pairs in the key-value storesettings resources: security settings, auth settings, and dynamic etcd cluster settings (election/heartbeat)
2. 权限资源
Users:user用来设置身份认证(user:passwd),一个用户可以拥有多个角色,每个角色被分配一定的权限(只读、只写、可读写),用户分为root用户和非root用户。
Roles:角色用来关联权限,角色主要三类:root角色。默认创建root用户时即创建了root角色,该角色拥有所有权限;guest角色,默认自动创建,主要用于非认证使用。普通角色,由root用户创建角色,并分配指定权限。
注意:如果没有指定任何验证方式,即没显示指定以什么用户进行访问,那么默认会设定为 guest 角色。默认情况下 guest 也是具有全局访问权限的。如果不希望未授权就获取或修改etcd的数据,则可收回guest角色的权限或删除该角色,etcdctl role revoke 。
Permissions:权限分为只读、只写、可读写三种权限,权限即对指定目录或key的读写权限。
3. ETCD访问控制
3.1. 访问控制相关命令
NAME:
etcdctl - A simple command line client for etcd.
USAGE:
etcdctl [global options] command [command options] [arguments...]
VERSION:
2.2.0
COMMANDS:
user user add, grant and revoke subcommands
role role add, grant and revoke subcommands
auth overall auth controls
GLOBAL OPTIONS:
--peers, -C a comma-delimited list of machine addresses in the cluster (default: "http://127.0.0.1:4001,http://127.0.0.1:2379")
--endpoint a comma-delimited list of machine addresses in the cluster (default: "http://127.0.0.1:4001,http://127.0.0.1:2379")
--cert-file identify HTTPS client using this SSL certificate file
--key-file identify HTTPS client using this SSL key file
--ca-file verify certificates of HTTPS-enabled servers using this CA bundle
--username, -u provide username[:password] and prompt if password is not supplied.
--timeout '1s' connection timeout per request
3.2. user相关命令
[root@localhost etcd]# etcdctl user --help
NAME:
etcdctl user - user add, grant and revoke subcommands
USAGE:
etcdctl user command [command options] [arguments...]
COMMANDS:
add add a new user for the etcd cluster
get get details for a user
list list all current users
remove remove a user for the etcd cluster
grant grant roles to an etcd user
revoke revoke roles for an etcd user
passwd change password for a user
help, h Shows a list of commands or help for one command
OPTIONS:
--help, -h show help
3.2.1. 添加root用户并设置密码
etcdctl --endpoints http://172.16.22.36:2379 user add root
3.2.2. 添加非root用户并设置密码
etcdctl --endpoints http://172.16.22.36:2379 --username root:123 user add huwh
3.2.3. 查看当前所有用户
etcdctl --endpoints http://172.16.22.36:2379 --username root:123 user list
3.2.4. 将用户添加到对应角色
etcdctl --endpoints http://172.16.22.36:2379 --username root:123 user grant --roles test1 phpor
3.2.5. 查看用户拥有哪些角色
etcdctl --endpoints http://172.16.22.36:2379 --username root:123 user get phpor
3.3. role相关命令
[root@localhost etcd]# etcdctl role --help
NAME:
etcdctl role - role add, grant and revoke subcommands
USAGE:
etcdctl role command [command options] [arguments...]
COMMANDS:
add add a new role for the etcd cluster
get get details for a role
list list all roles
remove remove a role from the etcd cluster
grant grant path matches to an etcd role
revoke revoke path matches for an etcd role
help, h Shows a list of commands or help for one command
OPTIONS:
--help, -h show help
3.3.1. 添加角色
etcdctl --endpoints http://172.16.22.36:2379 --username root:2379 role add test1
3.3.2. 查看所有角色
etcdctl --endpoints http://172.16.22.36:2379 --username root:123 role list
3.3.3. 给角色分配权限
[root@localhost etcd]# etcdctl role grant --help
NAME:
grant - grant path matches to an etcd role
USAGE:
command grant [command options] [arguments...]
OPTIONS:
--path Path granted for the role to access
--read Grant read-only access
--write Grant write-only access
--readwrite Grant read-write access
1、只包含目录 etcdctl --endpoints http://172.16.22.36:2379 --username root:123 role grant --readwrite --path /test1 test1
2、包括目录和子目录或文件 etcdctl --endpoints http://172.16.22.36:2379 --username root:123 role grant --readwrite --path /test1/* test1
3.3.4. 查看角色所拥有的权限
etcdctl --endpoints http://172.16.22.36:2379 --username root:2379 role get test1
3.4. auth相关操作
[root@localhost etcd]# etcdctl auth --help
NAME:
etcdctl auth - overall auth controls
USAGE:
etcdctl auth command [command options] [arguments...]
COMMANDS:
enable enable auth access controls
disable disable auth access controls
help, h Shows a list of commands or help for one command
OPTIONS:
--help, -h show help
3.4.1. 开启认证
etcdctl --endpoints http://172.16.22.36:2379 auth enable
4. 访问控制设置步骤
| 顺序 | 步骤 | 命令 |
|---|---|---|
| 1 | 添加root用户 | etcdctl --endpoints http:// |
| 2 | 开启认证 | etcdctl --endpoints http:// |
| 3 | 添加非root用户 | etcdctl --endpoints http:// |
| 4 | 添加角色 | etcdctl --endpoints http:// |
| 5 | 给角色授权(只读、只写、可读写) | etcdctl --endpoints http:// |
| 6 | 给用户分配角色(即分配了角色对应的权限) | etcdctl --endpoints http:// |
5. 访问认证的API调用
更多参考
13.5 - Etcd启动配置参数
1. Etcd配置参数
/ # etcd --help
usage: etcd [flags]
start an etcd server
etcd --version
show the version of etcd
etcd -h | --help
show the help information about etcd
etcd --config-file
path to the server configuration file
etcd gateway
run the stateless pass-through etcd TCP connection forwarding proxy
etcd grpc-proxy
run the stateless etcd v3 gRPC L7 reverse proxy
1.1. member flags
member flags:
--name 'default'
human-readable name for this member.
--data-dir '${name}.etcd'
path to the data directory.
--wal-dir ''
path to the dedicated wal directory.
--snapshot-count '100000'
number of committed transactions to trigger a snapshot to disk.
--heartbeat-interval '100'
time (in milliseconds) of a heartbeat interval.
--election-timeout '1000'
time (in milliseconds) for an election to timeout. See tuning documentation for details.
--initial-election-tick-advance 'true'
whether to fast-forward initial election ticks on boot for faster election.
--listen-peer-urls 'http://localhost:2380'
list of URLs to listen on for peer traffic.
--listen-client-urls 'http://localhost:2379'
list of URLs to listen on for client traffic.
--max-snapshots '5'
maximum number of snapshot files to retain (0 is unlimited).
--max-wals '5'
maximum number of wal files to retain (0 is unlimited).
--cors ''
comma-separated whitelist of origins for CORS (cross-origin resource sharing).
--quota-backend-bytes '0'
raise alarms when backend size exceeds the given quota (0 defaults to low space quota).
--max-txn-ops '128'
maximum number of operations permitted in a transaction.
--max-request-bytes '1572864'
maximum client request size in bytes the server will accept.
--grpc-keepalive-min-time '5s'
minimum duration interval that a client should wait before pinging server.
--grpc-keepalive-interval '2h'
frequency duration of server-to-client ping to check if a connection is alive (0 to disable).
--grpc-keepalive-timeout '20s'
additional duration of wait before closing a non-responsive connection (0 to disable).
1.2. clustering flags
clustering flags:
--initial-advertise-peer-urls 'http://localhost:2380'
list of this member's peer URLs to advertise to the rest of the cluster.
--initial-cluster 'default=http://localhost:2380'
initial cluster configuration for bootstrapping.
--initial-cluster-state 'new'
initial cluster state ('new' or 'existing').
--initial-cluster-token 'etcd-cluster'
initial cluster token for the etcd cluster during bootstrap.
Specifying this can protect you from unintended cross-cluster interaction when running multiple clusters.
--advertise-client-urls 'http://localhost:2379'
list of this member's client URLs to advertise to the public.
The client URLs advertised should be accessible to machines that talk to etcd cluster. etcd client libraries parse these URLs to connect to the cluster.
--discovery ''
discovery URL used to bootstrap the cluster.
--discovery-fallback 'proxy'
expected behavior ('exit' or 'proxy') when discovery services fails.
"proxy" supports v2 API only.
--discovery-proxy ''
HTTP proxy to use for traffic to discovery service.
--discovery-srv ''
dns srv domain used to bootstrap the cluster.
--strict-reconfig-check 'true'
reject reconfiguration requests that would cause quorum loss.
--auto-compaction-retention '0'
auto compaction retention length. 0 means disable auto compaction.
--auto-compaction-mode 'periodic'
interpret 'auto-compaction-retention' one of: periodic|revision. 'periodic' for duration based retention, defaulting to hours if no time unit is provided (e.g. '5m'). 'revision' for revision number based retention.
--enable-v2 'true'
Accept etcd V2 client requests.
1.3. proxy flags
proxy flags:
"proxy" supports v2 API only.
--proxy 'off'
proxy mode setting ('off', 'readonly' or 'on').
--proxy-failure-wait 5000
time (in milliseconds) an endpoint will be held in a failed state.
--proxy-refresh-interval 30000
time (in milliseconds) of the endpoints refresh interval.
--proxy-dial-timeout 1000
time (in milliseconds) for a dial to timeout.
--proxy-write-timeout 5000
time (in milliseconds) for a write to timeout.
--proxy-read-timeout 0
time (in milliseconds) for a read to timeout.
1.4. security flags
security flags:
--ca-file '' [DEPRECATED]
path to the client server TLS CA file. '-ca-file ca.crt' could be replaced by '-trusted-ca-file ca.crt -client-cert-auth' and etcd will perform the same.
--cert-file ''
path to the client server TLS cert file.
--key-file ''
path to the client server TLS key file.
--client-cert-auth 'false'
enable client cert authentication.
--client-crl-file ''
path to the client certificate revocation list file.
--trusted-ca-file ''
path to the client server TLS trusted CA cert file.
--auto-tls 'false'
client TLS using generated certificates.
--peer-ca-file '' [DEPRECATED]
path to the peer server TLS CA file. '-peer-ca-file ca.crt' could be replaced by '-peer-trusted-ca-file ca.crt -peer-client-cert-auth' and etcd will perform the same.
--peer-cert-file ''
path to the peer server TLS cert file.
--peer-key-file ''
path to the peer server TLS key file.
--peer-client-cert-auth 'false'
enable peer client cert authentication.
--peer-trusted-ca-file ''
path to the peer server TLS trusted CA file.
--peer-auto-tls 'false'
peer TLS using self-generated certificates if --peer-key-file and --peer-cert-file are not provided.
--peer-crl-file ''
path to the peer certificate revocation list file.
1.5. logging flags
logging flags
--debug 'false'
enable debug-level logging for etcd.
--log-package-levels ''
specify a particular log level for each etcd package (eg: 'etcdmain=CRITICAL,etcdserver=DEBUG').
--log-output 'default'
specify 'stdout' or 'stderr' to skip journald logging even when running under systemd.
1.6. unsafe flags
unsafe flags:
Please be CAUTIOUS when using unsafe flags because it will break the guarantees
given by the consensus protocol.
--force-new-cluster 'false'
force to create a new one-member cluster.
1.7. profiling flags
profiling flags:
--enable-pprof 'false'
Enable runtime profiling data via HTTP server. Address is at client URL + "/debug/pprof/"
--metrics 'basic'
Set level of detail for exported metrics, specify 'extensive' to include histogram metrics.
--listen-metrics-urls ''
List of URLs to listen on for metrics.
1.8. auth flags
auth flags:
--auth-token 'simple'
Specify a v3 authentication token type and its options ('simple' or 'jwt').
1.9. experimental flags
experimental flags:
--experimental-initial-corrupt-check 'false'
enable to check data corruption before serving any client/peer traffic.
--experimental-corrupt-check-time '0s'
duration of time between cluster corruption check passes.
--experimental-enable-v2v3 ''
serve v2 requests through the v3 backend under a given prefix.
13.6 - Etcd中的k8s数据
1. 读取数据key
使用以下命令列出所有的key。
ETCDCTL_API=3 etcdctl --endpoints=<etcd-ip-1>:2379,<etcd-ip-2>:2379,<etcd-ip-3>:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --key=/etc/kubernetes/pki/apiserver-etcd-client.key --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt get / --prefix --keys-only
参数说明:
--cacert="" verify certificates of TLS-enabled secure servers using this CA bundle
--cert="" identify secure client using this TLS certificate file
--key="" identify secure client using this TLS key file
--endpoints=[127.0.0.1:2379] gRPC endpoints
可以使用alias来重命名etcdctl一串的命令
alias ectl='ETCDCTL_API=3 etcdctl --endpoints=<etcd-ip-1>:2379,<etcd-ip-2>:2379,<etcd-ip-3>:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --key=/etc/kubernetes/pki/apiserver-etcd-client.key --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt'
2. 集群数据
2.1. node
/registry/minions/<node-ip-1>
/registry/minions/<node-ip-2>
/registry/minions/<node-ip-3>
其他信息:
/registry/leases/kube-node-lease/<node-ip-1>
/registry/leases/kube-node-lease/<node-ip-2>
/registry/leases/kube-node-lease/<node-ip-3>
/registry/masterleases/<node-ip-2>
/registry/masterleases/<node-ip-3>
3. k8s对象数据
k8s对象数据的格式
3.1. namespace
/registry/namespaces/default
/registry/namespaces/game
/registry/namespaces/kube-node-lease
/registry/namespaces/kube-public
/registry/namespaces/kube-system
3.2. namespace级别对象
/registry/{resource}/{namespace}/{resource_name}
以下以常见k8s对象为例:
# deployment
/registry/deployments/default/game-2048
/registry/deployments/kube-system/prometheus-operator
# replicasets
/registry/replicasets/default/game-2048-c7d589ccf
# pod
/registry/pods/default/game-2048-c7d589ccf-8lsbw
# statefulsets
/registry/statefulsets/kube-system/prometheus-k8s
# daemonsets
/registry/daemonsets/kube-system/kube-proxy
# secrets
/registry/secrets/default/default-token-tbfmb
# serviceaccounts
/registry/serviceaccounts/default/default
service
# service
/registry/services/specs/default/game-2048
# endpoints
/registry/services/endpoints/default/game-2048
4. 读取数据value
由于k8s默认etcd中的数据是通过protobuf格式存储,因此看到的key和value的值是一串字符串。
alias ectl='ETCDCTL_API=3 etcdctl --endpoints=
:2379, :2379, :2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --key=/etc/kubernetes/pki/apiserver-etcd-client.key --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt'
# ectl get /registry/namespaces/test -w json |jq
{
"header": {
"cluster_id": 12113422651334595000,
"member_id": 8381627376898157000,
"revision": 12321629,
"raft_term": 20
},
"kvs": [
{
"key": "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvdGVzdA==",
"create_revision": 11670741,
"mod_revision": 11670741,
"version": 1,
"value": "azhzAAoPCgJ2MRIJTmFtZXNwYWNlElwKQgoEdGVzdBIAGgAiACokYWM1YmJjOTQtNTkxZi0xMWVhLWJiOTQtNmM5MmJmM2I3NmI1MgA4AEIICJuf3fIFEAB6ABIMCgprdWJlcm5ldGVzGggKBkFjdGl2ZRoAIgA="
}
],
"count": 1
}
其中key可以通过base64解码出来
echo "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvdGVzdA==" | base64 --decode
# output
/registry/namespaces/test
value是值可以通过安装etcdhelper工具解析出来。
alias ehelper='etcdhelper -key /etc/kubernetes/pki/apiserver-etcd-client.key -cert /etc/kubernetes/pki/apiserver-etcd-client.crt -cacert /etc/kubernetes/pki/etcd/ca.crt'
# ehelper get /registry/namespaces/test
/v1, Kind=Namespace
{
"kind": "Namespace",
"apiVersion": "v1",
"metadata": {
"name": "test",
"uid": "ac5bbc94-591f-11ea-bb94-6c92bf3b76b5",
"creationTimestamp": "2020-02-27T05:11:55Z"
},
"spec": {
"finalizers": [
"kubernetes"
]
},
"status": {
"phase": "Active"
}
}
5. 注意事项
- 由于k8s的etcd数据为了性能考虑,默认通过
protobuf格式存储,不要通过手动的方式去修改或添加k8s数据。 - 不推荐使用json格式存储etcd数据,如果需要json格式,可以使用
--storage-media-type=application/json参数存储,参考:https://github.com/kubernetes/kubernetes/issues/44670
6. 快捷命令
由于etcdctl的命令需要添加很多认证参数和endpoints的参数,因此可以使用别名的方式来简化命令。
# etcdctl
alias ectl='ETCDCTL_API=3 etcdctl --endpoints=<etcd-ip-1>:2379,<etcd-ip-2>:2379,<etcd-ip-3>:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --key=/etc/kubernetes/pki/apiserver-etcd-client.key --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt'
# etcdhelper
alias ehelper='etcdhelper -key /etc/kubernetes/pki/apiserver-etcd-client.key -cert /etc/kubernetes/pki/apiserver-etcd-client.crt -cacert /etc/kubernetes/pki/etcd/ca.crt'
6.1. etcdhelper的使用
etcdhelper文档参考:https://github.com/openshift/origin/tree/master/tools/etcdhelper
# 必要的认证参数
-key - points to master.etcd-client.key
-cert - points to master.etcd-client.crt
-cacert - points to ca.crt
# 命令操作参数
ls - list all keys starting with prefix
get - get the specific value of a key
dump - dump the entire contents of the etcd
示例
$ ehelper ls /registry/leases/
/registry/leases/kube-node-lease/<ip-1>
/registry/leases/kube-node-lease/<ip-2>
/registry/leases/kube-node-lease/<ip-3>
$ ehelper get <key>
7. RBAC
附RBAC相关的key。
clusterrolebindings
/registry/clusterrolebindings/cluster-admin
/registry/clusterrolebindings/flannel
/registry/clusterrolebindings/galaxy
/registry/clusterrolebindings/helm
/registry/clusterrolebindings/kube-state-metrics
/registry/clusterrolebindings/kubeadm:kubelet-bootstrap
/registry/clusterrolebindings/kubeadm:node-autoapprove-bootstrap
/registry/clusterrolebindings/kubeadm:node-autoapprove-certificate-rotation
/registry/clusterrolebindings/kubeadm:node-proxier
/registry/clusterrolebindings/lbcf-controller
/registry/clusterrolebindings/prometheus-k8s
/registry/clusterrolebindings/prometheus-operator
/registry/clusterrolebindings/system:aws-cloud-provider
/registry/clusterrolebindings/system:basic-user
/registry/clusterrolebindings/system:controller:attachdetach-controller
/registry/clusterrolebindings/system:controller:certificate-controller
/registry/clusterrolebindings/system:controller:clusterrole-aggregation-controller
/registry/clusterrolebindings/system:controller:cronjob-controller
/registry/clusterrolebindings/system:controller:daemon-set-controller
/registry/clusterrolebindings/system:controller:deployment-controller
/registry/clusterrolebindings/system:controller:disruption-controller
/registry/clusterrolebindings/system:controller:endpoint-controller
/registry/clusterrolebindings/system:controller:expand-controller
/registry/clusterrolebindings/system:controller:generic-garbage-collector
/registry/clusterrolebindings/system:controller:horizontal-pod-autoscaler
/registry/clusterrolebindings/system:controller:job-controller
/registry/clusterrolebindings/system:controller:namespace-controller
/registry/clusterrolebindings/system:controller:node-controller
/registry/clusterrolebindings/system:controller:persistent-volume-binder
/registry/clusterrolebindings/system:controller:pod-garbage-collector
/registry/clusterrolebindings/system:controller:pv-protection-controller
/registry/clusterrolebindings/system:controller:pvc-protection-controller
/registry/clusterrolebindings/system:controller:replicaset-controller
/registry/clusterrolebindings/system:controller:replication-controller
/registry/clusterrolebindings/system:controller:resourcequota-controller
/registry/clusterrolebindings/system:controller:route-controller
/registry/clusterrolebindings/system:controller:service-account-controller
/registry/clusterrolebindings/system:controller:service-controller
/registry/clusterrolebindings/system:controller:statefulset-controller
/registry/clusterrolebindings/system:controller:ttl-controller
/registry/clusterrolebindings/system:coredns
/registry/clusterrolebindings/system:discovery
/registry/clusterrolebindings/system:kube-controller-manager
/registry/clusterrolebindings/system:kube-dns
/registry/clusterrolebindings/system:kube-scheduler
/registry/clusterrolebindings/system:node
/registry/clusterrolebindings/system:node-proxier
/registry/clusterrolebindings/system:public-info-viewer
/registry/clusterrolebindings/system:volume-scheduler
clusterroles
/registry/clusterroles/admin
/registry/clusterroles/cluster-admin
/registry/clusterroles/edit
/registry/clusterroles/flannel
/registry/clusterroles/kube-state-metrics
/registry/clusterroles/lbcf-controller
/registry/clusterroles/prometheus-k8s
/registry/clusterroles/prometheus-operator
/registry/clusterroles/system:aggregate-to-admin
/registry/clusterroles/system:aggregate-to-edit
/registry/clusterroles/system:aggregate-to-view
/registry/clusterroles/system:auth-delegator
/registry/clusterroles/system:aws-cloud-provider
/registry/clusterroles/system:basic-user
/registry/clusterroles/system:certificates.k8s.io:certificatesigningrequests:nodeclient
/registry/clusterroles/system:certificates.k8s.io:certificatesigningrequests:selfnodeclient
/registry/clusterroles/system:controller:attachdetach-controller
/registry/clusterroles/system:controller:certificate-controller
/registry/clusterroles/system:controller:clusterrole-aggregation-controller
/registry/clusterroles/system:controller:cronjob-controller
/registry/clusterroles/system:controller:daemon-set-controller
/registry/clusterroles/system:controller:deployment-controller
/registry/clusterroles/system:controller:disruption-controller
/registry/clusterroles/system:controller:endpoint-controller
/registry/clusterroles/system:controller:expand-controller
/registry/clusterroles/system:controller:generic-garbage-collector
/registry/clusterroles/system:controller:horizontal-pod-autoscaler
/registry/clusterroles/system:controller:job-controller
/registry/clusterroles/system:controller:namespace-controller
/registry/clusterroles/system:controller:node-controller
/registry/clusterroles/system:controller:persistent-volume-binder
/registry/clusterroles/system:controller:pod-garbage-collector
/registry/clusterroles/system:controller:pv-protection-controller
/registry/clusterroles/system:controller:pvc-protection-controller
/registry/clusterroles/system:controller:replicaset-controller
/registry/clusterroles/system:controller:replication-controller
/registry/clusterroles/system:controller:resourcequota-controller
/registry/clusterroles/system:controller:route-controller
/registry/clusterroles/system:controller:service-account-controller
/registry/clusterroles/system:controller:service-controller
/registry/clusterroles/system:controller:statefulset-controller
/registry/clusterroles/system:controller:ttl-controller
/registry/clusterroles/system:coredns
/registry/clusterroles/system:csi-external-attacher
/registry/clusterroles/system:csi-external-provisioner
/registry/clusterroles/system:discovery
/registry/clusterroles/system:heapster
/registry/clusterroles/system:kube-aggregator
/registry/clusterroles/system:kube-controller-manager
/registry/clusterroles/system:kube-dns
/registry/clusterroles/system:kube-scheduler
/registry/clusterroles/system:kubelet-api-admin
/registry/clusterroles/system:node
/registry/clusterroles/system:node-bootstrapper
/registry/clusterroles/system:node-problem-detector
/registry/clusterroles/system:node-proxier
/registry/clusterroles/system:persistent-volume-provisioner
/registry/clusterroles/system:public-info-viewer
/registry/clusterroles/system:volume-scheduler
/registry/clusterroles/view
rolebindings
/registry/rolebindings/kube-public/kubeadm:bootstrap-signer-clusterinfo
/registry/rolebindings/kube-public/system:controller:bootstrap-signer
/registry/rolebindings/kube-system/kube-proxy
/registry/rolebindings/kube-system/kube-state-metrics
/registry/rolebindings/kube-system/kubeadm:kubeadm-certs
/registry/rolebindings/kube-system/kubeadm:kubelet-config-1.14
/registry/rolebindings/kube-system/kubeadm:nodes-kubeadm-config
/registry/rolebindings/kube-system/system::extension-apiserver-authentication-reader
/registry/rolebindings/kube-system/system::leader-locking-kube-controller-manager
/registry/rolebindings/kube-system/system::leader-locking-kube-scheduler
/registry/rolebindings/kube-system/system:controller:bootstrap-signer
/registry/rolebindings/kube-system/system:controller:cloud-provider
/registry/rolebindings/kube-system/system:controller:token-cleaner
roles
/registry/roles/kube-public/kubeadm:bootstrap-signer-clusterinfo
/registry/roles/kube-public/system:controller:bootstrap-signer
/registry/roles/kube-system/extension-apiserver-authentication-reader
/registry/roles/kube-system/kube-proxy
/registry/roles/kube-system/kube-state-metrics-resizer
/registry/roles/kube-system/kubeadm:kubeadm-certs
/registry/roles/kube-system/kubeadm:kubelet-config-1.14
/registry/roles/kube-system/kubeadm:nodes-kubeadm-config
/registry/roles/kube-system/system::leader-locking-kube-controller-manager
/registry/roles/kube-system/system::leader-locking-kube-scheduler
/registry/roles/kube-system/system:controller:bootstrap-signer
/registry/roles/kube-system/system:controller:cloud-provider
/registry/roles/kube-system/system:controller:token-cleaner
参考:
13.7 - etcd-operator的使用
本文主要介绍etcd-operator的部署及使用
1. 部署RBAC
下载create_role.sh、cluster-role-binding-template.yaml、cluster-role-template.yaml
例如:
|-- cluster-role-binding-template.yaml
|-- cluster-role-template.yaml
|-- create_role.sh
# 部署rbac
kubectl create ns operator
bash create_role.sh --namespace=operator # namespace与etcd-operator的ns一致
示例:
bash create_role.sh --namespace=operator
+ ROLE_NAME=etcd-operator
+ ROLE_BINDING_NAME=etcd-operator
+ NAMESPACE=default
+ for i in '"$@"'
+ case $i in
+ NAMESPACE=operator
+ echo 'Creating role with ROLE_NAME=etcd-operator, NAMESPACE=operator'
Creating role with ROLE_NAME=etcd-operator, NAMESPACE=operator
+ sed -e 's/<ROLE_NAME>/etcd-operator/g' -e 's/<NAMESPACE>/operator/g' cluster-role-template.yaml
+ kubectl create -f -
clusterrole.rbac.authorization.k8s.io/etcd-operator created
+ echo 'Creating role binding with ROLE_NAME=etcd-operator, ROLE_BINDING_NAME=etcd-operator, NAMESPACE=operator'
Creating role binding with ROLE_NAME=etcd-operator, ROLE_BINDING_NAME=etcd-operator, NAMESPACE=operator
+ sed -e 's/<ROLE_NAME>/etcd-operator/g' -e 's/<ROLE_BINDING_NAME>/etcd-operator/g' -e 's/<NAMESPACE>/operator/g' cluster-role-binding-template.yaml
+ kubectl create -f -
clusterrolebinding.rbac.authorization.k8s.io/etcd-operator created
1.1. create_role.sh 脚本
create_role.sh有三个入参,可以指定--namespace参数,该参数与etcd-operator部署的namespace应一致。默认为default。
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
ETCD_OPERATOR_ROOT=$(dirname "${BASH_SOURCE}")/../..
print_usage() {
echo "$(basename "$0") - Create Kubernetes RBAC role and role bindings for etcd-operator
Usage: $(basename "$0") [options...]
Options:
--role-name=STRING Name of ClusterRole to create
(default=\"etcd-operator\", environment variable: ROLE_NAME)
--role-binding-name=STRING Name of ClusterRoleBinding to create
(default=\"etcd-operator\", environment variable: ROLE_BINDING_NAME)
--namespace=STRING namespace to create role and role binding in. Must already exist.
(default=\"default\", environment variable: NAMESPACE)
" >&2
}
ROLE_NAME="${ROLE_NAME:-etcd-operator}"
ROLE_BINDING_NAME="${ROLE_BINDING_NAME:-etcd-operator}"
NAMESPACE="${NAMESPACE:-default}"
for i in "$@"
do
case $i in
--role-name=*)
ROLE_NAME="${i#*=}"
;;
--role-binding-name=*)
ROLE_BINDING_NAME="${i#*=}"
;;
--namespace=*)
NAMESPACE="${i#*=}"
;;
-h|--help)
print_usage
exit 0
;;
*)
print_usage
exit 1
;;
esac
done
echo "Creating role with ROLE_NAME=${ROLE_NAME}, NAMESPACE=${NAMESPACE}"
sed -e "s/<ROLE_NAME>/${ROLE_NAME}/g" \
-e "s/<NAMESPACE>/${NAMESPACE}/g" \
"cluster-role-template.yaml" | \
kubectl create -f -
echo "Creating role binding with ROLE_NAME=${ROLE_NAME}, ROLE_BINDING_NAME=${ROLE_BINDING_NAME}, NAMESPACE=${NAMESPACE}"
sed -e "s/<ROLE_NAME>/${ROLE_NAME}/g" \
-e "s/<ROLE_BINDING_NAME>/${ROLE_BINDING_NAME}/g" \
-e "s/<NAMESPACE>/${NAMESPACE}/g" \
"cluster-role-binding-template.yaml" | \
kubectl create -f -
1.2. cluster-role-binding-template.yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: <ROLE_BINDING_NAME>
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: <ROLE_NAME>
subjects:
- kind: ServiceAccount
name: default
namespace: <NAMESPACE>
1.3. cluster-role-template.yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: <ROLE_NAME>
rules:
- apiGroups:
- etcd.database.coreos.com
resources:
- etcdclusters
- etcdbackups
- etcdrestores
verbs:
- "*"
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- "*"
- apiGroups:
- ""
resources:
- pods
- services
- endpoints
- persistentvolumeclaims
- events
verbs:
- "*"
- apiGroups:
- apps
resources:
- deployments
verbs:
- "*"
# The following permissions can be removed if not using S3 backup and TLS
- apiGroups:
- ""
resources:
- secrets
verbs:
- get
2. 部署etcd-operator
kubectl create -f etcd-operator.yaml
etcd-operator.yaml如下:
apiVersion: apps/v1
kind: Deployment
metadata:
name: etcd-operator
namespace: operator # 与rbac指定的ns一致
labels:
app: etcd-operator
spec:
replicas: 1
selector:
matchLabels:
app: etcd-operator
template:
metadata:
labels:
app: etcd-operator
spec:
containers:
- name: etcd-operator
image: registry.cn-shenzhen.aliyuncs.com/huweihuang/etcd-operator:v0.9.4
command:
- etcd-operator
# Uncomment to act for resources in all namespaces. More information in doc/user/clusterwide.md
- -cluster-wide
env:
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
查看CRD
#kubectl get customresourcedefinitions
NAME CREATED AT
etcdclusters.etcd.database.coreos.com 2020-08-01T13:02:18Z
查看etcd-operator的日志是否OK。
k logs -f etcd-operator-545df8d445-qpf6n -n operator
time="2020-08-01T13:02:18Z" level=info msg="etcd-operator Version: 0.9.4"
time="2020-08-01T13:02:18Z" level=info msg="Git SHA: c8a1c64"
time="2020-08-01T13:02:18Z" level=info msg="Go Version: go1.11.5"
time="2020-08-01T13:02:18Z" level=info msg="Go OS/Arch: linux/amd64"
time="2020-08-01T13:02:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"operator\", Name:\"etcd-operator\", UID:\"7de38cff-1b7b-4bf2-9837-473fa66c9366\", APIVersion:\"v1\", ResourceVersion:\"41195930\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' etcd-operator-545df8d445-qpf6n became leader"
以上内容表示etcd-operator运行正常。
3. 部署etcd集群
kubectl create -f etcd-cluster.yaml
当开启clusterwide则etcd集群与etcd-operator的ns可不同。
etcd-cluster.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
name: "default-etcd-cluster"
## Adding this annotation make this cluster managed by clusterwide operators
## namespaced operators ignore it
annotations:
etcd.database.coreos.com/scope: clusterwide
namespace: etcd # 此处的ns表示etcd集群部署在哪个ns下
spec:
size: 3
version: "v3.3.18"
repository: registry.cn-shenzhen.aliyuncs.com/huweihuang/etcd
pod:
busyboxImage: registry.cn-shenzhen.aliyuncs.com/huweihuang/busybox:1.28.0-glibc
查看集群部署结果
$ kgpo -n etcd
NAME READY STATUS RESTARTS AGE
default-etcd-cluster-b6phnpf8z8 1/1 Running 0 3m3s
default-etcd-cluster-hhgq4sbtgr 1/1 Running 0 109s
default-etcd-cluster-ttfh5fj92b 1/1 Running 0 2m29s
4. 访问etcd集群
查看service
$ kgsvc -n etcd
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default-etcd-cluster ClusterIP None <none> 2379/TCP,2380/TCP 5m37s
default-etcd-cluster-client ClusterIP 192.168.255.244 <none> 2379/TCP 5m37s
使用service地址访问
# 查看集群健康状态
$ ETCDCTL_API=3 etcdctl --endpoints 192.168.255.244:2379 endpoint health
192.168.255.244:2379 is healthy: successfully committed proposal: took = 1.96126ms
# 写数据
$ ETCDCTL_API=3 etcdctl --endpoints 192.168.255.244:2379 put foo bar
OK
# 读数据
$ ETCDCTL_API=3 etcdctl --endpoints 192.168.255.244:2379 get foo
foo
bar
5. 销毁etcd-operator
kubectl delete -f example/deployment.yaml
kubectl delete endpoints etcd-operator
kubectl delete crd etcdclusters.etcd.database.coreos.com
kubectl delete clusterrole etcd-operator
kubectl delete clusterrolebinding etcd-operator
参考:
- https://github.com/coreos/etcd-operator
- https://github.com/coreos/etcd-operator/blob/master/doc/user/install_guide.md
- https://github.com/coreos/etcd-operator/blob/master/doc/user/client_service.md
- https://github.com/coreos/etcd-operator/blob/master/doc/user/spec_examples.md
- https://github.com/coreos/etcd-operator/blob/master/doc/user/cluster_tls.md
14 - 多集群管理
14.1 - k8s多集群管理的思考
k8s多集群的思考
1. 为什么需要多集群
1、k8s单集群的承载能力有限。
Kubernetes v1.21 支持的最大节点数为 5000。 更具体地说,Kubernetes旨在适应满足以下所有标准的配置:
- 每个节点的 Pod 数量不超过 100
- 节点数不超过 5000
- Pod 总数不超过 150000
- 容器总数不超过 300000
参考:https://kubernetes.io/zh/docs/setup/best-practices/cluster-large/
且当节点数量较大时,会出现调度延迟,etcd读写延迟,apiserver负载高等问题,影响服务的正常创建。
2、分散集群服务风险。
全部服务都放在一个k8s集群中,当该集群出现异常,短期无法恢复的情况下,则影响全部服务和影响部署。为了避免机房等故障导致单集群异常,建议将k8s的master在分散在延迟较低的不同可用区部署,且在不同region部署多个k8s集群来进行集群级别的容灾。
3、当前混合云的使用方式和架构
当前部分公司会存在自建机房+不同云厂商的公有云从而来实现混部云的运营模式,那么自然会引入多集群管理的问题。
2. 多集群部署需要解决哪些问题
目标:让用户像使用单集群一样来使用多集群。
扩展集群的边界,服务的边界从单台物理机多个进程,发展到通过k8s集群来管理多台的物理机,再发展到管理多个的k8s集群。服务的边界从物理机发展到集群。
而多集群管理需要解决以下问题:
- 多集群服务的分发部署(deployment、daemonset等)
- 跨集群自动迁移与调度(当某个集群异常,服务可以在其他集群自动部署)
- 多集群服务发现,网络通信及负载均衡(service,ingress等)
而多集群服务的网络通信可以由Service mesh等来解决,本文不做重点讨论。
以上几个问题,可以先从k8s管理节点的思维进行分析
| 物理机视角 | 单集群视角 | 多集群视角 | |
|---|---|---|---|
| 进程的边界 | 物理机 | k8s集群 | 多集群 |
| 调度单元 | 进程或线程 | 容器或pod | 工作负载(deployment) |
| 服务的集合 | 工作负载(deployment) | 不同集群工作负载的集合体(workloadGroup) | |
| 服务发现 | service | 不同集群service的集合体 | |
| 服务迁移 | 工作负载(deployment)控制器 | 不同集群工作负载的集合体控制器 | |
| 服务调度 | nodename或者node selector | clustername或cluster selector | |
| pod的反亲和(相同deployment下的pod不调度在相同节点) | workload反亲和(相同workloadGroup分散在不同集群) |
2.1. 多集群工作负载的分发
单集群中k8s的调度单元是pod,即一个pod只能跑在一个节点上,一个节点可以运行多个pod,而不同节点上的一组pod是通过一个workload来控制和分发。类似这个逻辑,那么在多集群的视角下,多集群的调度单元是一个集群的workload,一个workload只能跑在一个集群中,一个集群可以运行多个workload。
那么就需要有一个控制器来管理不同k8s集群的相同workload。例如 workloadGroup。而该workloadGroup在不侵入k8s原生API的情况下,主要包含两个部分。
workloadGroup:
- 资源模板(Resource Template):服务的描述(workload)
- 分发策略(Propagaion Policy):服务分发的集群(即多个workload应该被分发到哪些集群运行)
workload描述的是什么服务运行在什么节点,workloadGroup描述的是什么服务运行在什么集群。
实现workloadGroup有两种方式:
- 一种是自定义API将workloadGroup中的Resource Template和Propagaion Policy合成在一个自定义的对象中,由用户直接指定该workloadGroup信息,从而将不同的workload分发到不同的集群中。
- 另一种方式是通过一个k8s载体来记录一个具体的workload对象,再由用户指定Propagaion Policy关联该workload对象,从而让控制器自动根据用户指定的Propagaion Policy将workload分发到不同的集群中。
2.2. 跨集群自动迁移与调度
单集群中k8s中通过workload中的nodeselector或者nodename以及亲和性来控制pod运行在哪个节点上。而多集群的视角下,则需要有一个控制器来实现集群级别的调度逻辑,例如clustername,cluster selector,cluster AntiAffinity,从而来自动控制workloadGroup下的workload分散在什么集群上。
3. 目前的多集群方案
3.1. Kubefed[Federation v2]
简介
基本思想
3.2. virtual kubelet
简介
基本思想
3.3. Karmada
简介
基本思想
参考:
- https://kubernetes.io/zh/docs/setup/best-practices/cluster-large/
- CoreOS 是如何将 Kubernetes 的性能提高 10 倍的?
- 当 K8s 集群达到万级规模,阿里巴巴如何解决系统各组件性能问题?
- https://github.com/kubernetes-sigs/kubefed
- https://jimmysong.io/kubernetes-handbook/practice/federation.html
- https://kubernetes.io/blog/2018/12/12/kubernetes-federation-evolution/
- https://zhuanlan.zhihu.com/p/355193315
14.2 - Virtual Kubelet
14.2.1 - Virtual Kubelet介绍
1. 简介
Virtual Kubelet是 Kubernetes kubelet 的一种实现,作为一种虚拟的kubelet用来连接k8s集群和其他平台的API。这允许k8s的节点由其他提供者(provider)提供支持,这些提供者例如serverless平台(ACI, AWS Fargate)、IoT Edge等。
一句话概括:Kubernetes API on top, programmable back。
2. 架构图
3. 功能
virtual kubelet提供一个可以自定义k8s node的依赖库。
目前支持的功能如下:
- 创建、删除、更新 pod
- 容器的日志、exec命令、metrics
- 获取pod、pod列表、pod status
- node的地址、容量、daemon
- 操作系统
- 自定义virtual network
4. Providers
virtual kubelet提供一个插件式的provider接口,让开发者可以自定义实现传统kubelet的功能。自定义的provider可以用自己的配置文件和环境参数。
自定义的provider必须提供以下功能:
- 提供pod、容器、资源的生命周期管理的功能
- 符合virtual kubelet提供的API
- 不直接访问k8s apiserver,定义获取数据的回调机制,例如configmap、secrets
开源的provider
5. 自定义provider
创建自定义provider的目录。
git clone https://github.com/virtual-kubelet/virtual-kubelet
cd virtual-kubelet
mkdir providers/my-provider
5.1. PodLifecylceHandler
当pod被k8s创建、更新、删除时,会调用以下方法。
type PodLifecycleHandler interface {
// CreatePod takes a Kubernetes Pod and deploys it within the provider.
CreatePod(ctx context.Context, pod *corev1.Pod) error
// UpdatePod takes a Kubernetes Pod and updates it within the provider.
UpdatePod(ctx context.Context, pod *corev1.Pod) error
// DeletePod takes a Kubernetes Pod and deletes it from the provider.
DeletePod(ctx context.Context, pod *corev1.Pod) error
// GetPod retrieves a pod by name from the provider (can be cached).
GetPod(ctx context.Context, namespace, name string) (*corev1.Pod, error)
// GetPodStatus retrieves the status of a pod by name from the provider.
GetPodStatus(ctx context.Context, namespace, name string) (*corev1.PodStatus, error)
// GetPods retrieves a list of all pods running on the provider (can be cached).
GetPods(context.Context) ([]*corev1.Pod, error)
}
PodLifecycleHandler是被PodController来调用,来管理被分配到node上的pod。
pc, _ := node.NewPodController(podControllerConfig) // <-- instatiates the pod controller
pc.Run(ctx) // <-- starts watching for pods to be scheduled on the node
5.2. PodNotifier(optional)
PodNotifier是可选实现,该接口主要用来通知virtual kubelet的pod状态变化。如果没有实现该接口,virtual-kubelet会定期检查所有pod的状态。
type PodNotifier interface {
// NotifyPods instructs the notifier to call the passed in function when
// the pod status changes.
//
// NotifyPods should not block callers.
NotifyPods(context.Context, func(*corev1.Pod))
}
5.3. NodeProvider
NodeProvider用来通知virtual-kubelet关于node状态的变化,virtual-kubelet会定期检查node是状态并相应地更新k8s。
type NodeProvider interface {
// Ping checks if the node is still active.
// This is intended to be lightweight as it will be called periodically as a
// heartbeat to keep the node marked as ready in Kubernetes.
Ping(context.Context) error
// NotifyNodeStatus is used to asynchronously monitor the node.
// The passed in callback should be called any time there is a change to the
// node's status.
// This will generally trigger a call to the Kubernetes API server to update
// the status.
//
// NotifyNodeStatus should not block callers.
NotifyNodeStatus(ctx context.Context, cb func(*corev1.Node))
}
NodeProvider是被NodeController调用,来管理k8s中的node对象。
nc, _ := node.NewNodeController(nodeProvider, nodeSpec) // <-- instantiate a node controller from a node provider and a kubernetes node spec
nc.Run(ctx) // <-- creates the node in kubernetes and starts up he controller
5.4. 测试
进入到项目根目录
make test
5.5. 示例代码
-
Azure Container Instances Provider
https://github.com/virtual-kubelet/azure-aci/blob/master/aci.go#L541
-
Alibaba Cloud ECI Provider
https://github.com/virtual-kubelet/alibabacloud-eci/blob/master/eci.go#L177
-
AWS Fargate Provider
https://github.com/virtual-kubelet/aws-fargate/blob/master/provider.go#L110
参考:
14.2.2 - Virtual Kubelet命令
virtual-kubelet --help
#./virtual-kubelet --help
virtual-kubelet implements the Kubelet interface with a pluggable
backend implementation allowing users to create kubernetes nodes without running the kubelet.
This allows users to schedule kubernetes workloads on nodes that aren't running Kubernetes.
Usage:
virtual-kubelet [flags]
virtual-kubelet [command]
Available Commands:
help Help about any command
providers Show the list of supported providers
version Show the version of the program
Flags:
--cluster-domain string kubernetes cluster-domain (default is 'cluster.local') (default "cluster.local")
--disable-taint disable the virtual-kubelet node taint
--enable-node-lease use node leases (1.13) for node heartbeats
--full-resync-period duration how often to perform a full resync of pods between kubernetes and the provider (default 1m0s)
-h, --help help for virtual-kubelet
--klog.alsologtostderr log to standard error as well as files
--klog.log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--klog.log_dir string If non-empty, write log files in this directory
--klog.log_file string If non-empty, use this log file
--klog.log_file_max_size uint Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
--klog.logtostderr log to standard error instead of files (default true)
--klog.skip_headers If true, avoid header prefixes in the log messages
--klog.skip_log_headers If true, avoid headers when opening log files
--klog.stderrthreshold severity logs at or above this threshold go to stderr (default 2)
--klog.v Level number for the log level verbosity
--klog.vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
--kubeconfig string kube config file to use for connecting to the Kubernetes API server (default "/root/.kube/config")
--log-level string set the log level, e.g. "debug", "info", "warn", "error" (default "info")
--metrics-addr string address to listen for metrics/stats requests (default ":10255")
--namespace string kubernetes namespace (default is 'all')
--nodename string kubernetes node name (default "virtual-kubelet")
--os string Operating System (Linux/Windows) (default "Linux")
--pod-sync-workers int set the number of pod synchronization workers (default 10)
--provider string cloud provider
--provider-config string cloud provider configuration file
--startup-timeout duration How long to wait for the virtual-kubelet to start
--trace-exporter strings sets the tracing exporter to use, available exporters: [jaeger ocagent]
--trace-sample-rate string set probability of tracing samples
--trace-service-name string sets the name of the service used to register with the trace exporter (default "virtual-kubelet")
--trace-tag map add tags to include with traces in key=value form
Use "virtual-kubelet [command] --help" for more information about a command.
14.3 - Karmada
14.3.1 - Karmada介绍
本文由网络资源整理以作记录
简介
Karmada(Kubernetes Armada)是基于Kubernetes原生API的多集群管理系统。在多云和混合云场景下,Karmada提供可插拔,全自动化管理多集群应用,实现多云集中管理、高可用性、故障恢复和流量调度。
特性
- 基于K8s原生API的跨集群应用管理,用户可以方便快捷地将应用从单集群迁移到多集群。
- 中心式操作和管理Kubernetes集群。
- 跨集群应用可在多集群上自动扩展,故障转移和负载均衡。
- 高级的调度策略:区域,可用区,云提供商,集群亲和性/反亲和性。
- 支持创建分发用户自定义(CustomResourceDefinitions)资源。
框架结构

- ETCD:存储Karmada API对象。
- Karmada Scheduler:提供高级的多集群调度策略。
- Karmada Controller Manager: 包含多个Controller,Controller监听karmada对象并且与成员集群API server进行通信并创建成员集群的k8s对象。
- Cluster Controller:成员集群的生命周期管理与对象管理。
- Policy Controller:监听PropagationPolicy对象,创建ResourceBinding,配置资源分发策略。
- Binding Controller:监听ResourceBinding对象,并创建work对象响应资源清单。
- Execution Controller:监听work对象,并将资源分发到成员集群中。
资源分发流程
基本概念
- 资源模板(Resource Template):Karmada使用K8s原生API定义作为资源模板,便于快速对接K8s生态工具链。
- 分发策略(Propagaion Policy):Karmada提供独立的策略API,用来配置资源分发策略。
- 差异化策略(Override Policy):Karmada提供独立的差异化API,用来配置与集群相关的差异化配置。比如配置不同集群使用不同的镜像。
Karmada资源分发流程图:

参考:
15 - 边缘容器
15.1 - KubeEdge
15.1.1 - KubeEdge介绍
1. KubeEdge简介
KubeEdge是基于kubernetes之上将容器化应用的编排能力拓展到边缘主机或边缘设备,在云端和边缘端提供网络通信,应用部署、元数据同步等功能。同时支持MQTT协议,允许开发者在边缘端自定义接入边缘设备。
2. 功能
- 边缘计算:提供边缘节点自治能力,边缘节点数据处理能力。
- 便捷部署:开发者可以开发http或mqtt协议的应用,运行在云端和边缘端。
- k8s原生支持:可以通过k8s管理和监控边缘设备和边缘节点。
- 丰富的应用类型:可以在边缘端部署机器学习、图片识别、事件处理等应用。
3. 组件
3.1. 云端
-
CloudHub:一个web socket服务器,负责监听云端的更新、缓存及向
EdgeHub发送消息。 -
EdgeController:一个扩展的k8s控制器,负责管理边缘节点和pod元数据,同步边缘节点的数据,是
k8s-apiserver与EdgeCore的通信桥梁。 -
DeviceController:一个扩展的k8s控制器,负责管理节点设备,同步云端和边缘端的设备元数据和状态。
3.2. 边缘端
- EdgeHub:一个web socket客户端,负责云端与边缘端的信息交互,其中包括将云端的资源变更同步到边缘端及边缘端的状态变化同步到云端。
- Edged:运行在边缘节点,管理容器化应用的agent,负责pod生命周期的管理,类似kubelet。
- EventBus:一个MQTT客户端,与MQTT服务端交互,提供发布/订阅的能力。
- ServiceBus:一个HTTP客户端,与HTTP服务端交互。为云组件提供HTTP客户端功能,以访问在边缘运行的HTTP服务器。
- DeviceTwin:负责存储设备状态并同步设备状态到云端,同时提供应用的接口查询。
- MetaManager:
edged和edgehub之间的消息处理器,负责向轻量数据库(SQLite)存储或查询元数据。
4. 架构图

参考:
15.1.2 - KubeEdge源码分析
15.1.2.1 -
kubeedge源码分析之cloudcore
本文源码分析基于kubeedge v1.1.0
本文主要分析cloudcore中CloudCoreCommand的基本流程,具体的cloudhub、edgecontroller、devicecontroller模块的实现逻辑待后续单独文章分析。
目录结构:
cloud/cmd/cloudcore
cloudcore
├── app
│ ├── options
│ │ └── options.go
│ └── server.go # NewCloudCoreCommand、registerModules
└── cloudcore.go # main函数
cloudcore部分包含以下模块:
- cloudhub
- edgecontroller
- devicecontroller
1. main函数
kubeedge的代码采用cobra命令框架,代码风格与k8s源码风格类似。cmd目录主要为cobra command的基本内容及参数解析,pkg目录包含具体的实现逻辑。
cloud/cmd/cloudcore/cloudcore.go
func main() {
command := app.NewCloudCoreCommand()
logs.InitLogs()
defer logs.FlushLogs()
if err := command.Execute(); err != nil {
os.Exit(1)
}
}
2. NewCloudCoreCommand
NewCloudCoreCommand为cobra command的构造函数,该类函数一般包含以下部分:
- 构造option
- 添加Flags
- 运行Run函数(核心)
cloud/cmd/cloudcore/app/server.go
func NewCloudCoreCommand() *cobra.Command {
opts := options.NewCloudCoreOptions()
cmd := &cobra.Command{
Use: "cloudcore",
Long: `CloudCore is the core cloud part of KubeEdge, which contains three modules: cloudhub,
edgecontroller, and devicecontroller. Cloudhub is a web server responsible for watching changes at the cloud side,
caching and sending messages to EdgeHub. EdgeController is an extended kubernetes controller which manages
edge nodes and pods metadata so that the data can be targeted to a specific edge node. DeviceController is an extended
kubernetes controller which manages devices so that the device metadata/status date can be synced between edge and cloud.`,
Run: func(cmd *cobra.Command, args []string) {
verflag.PrintAndExitIfRequested()
flag.PrintFlags(cmd.Flags())
// To help debugging, immediately log version
klog.Infof("Version: %+v", version.Get())
registerModules()
// start all modules
core.Run()
},
}
fs := cmd.Flags()
namedFs := opts.Flags()
verflag.AddFlags(namedFs.FlagSet("global"))
globalflag.AddGlobalFlags(namedFs.FlagSet("global"), cmd.Name())
for _, f := range namedFs.FlagSets {
fs.AddFlagSet(f)
}
usageFmt := "Usage:\n %s\n"
cols, _, _ := term.TerminalSize(cmd.OutOrStdout())
cmd.SetUsageFunc(func(cmd *cobra.Command) error {
fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())
cliflag.PrintSections(cmd.OutOrStderr(), namedFs, cols)
return nil
})
cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine())
cliflag.PrintSections(cmd.OutOrStdout(), namedFs, cols)
})
return cmd
}
核心代码:
// 构造option
opts := options.NewCloudCoreOptions()
// 执行run函数
registerModules()
core.Run()
// 添加flags
fs.AddFlagSet(f)
3. registerModules
由于kubeedge的代码的大部分模块都采用了基于go-channel的消息通信框架Beehive(待后续单独文章分析),因此在各模块启动之前,需要将该模块注册到beehive的框架中。
其中cloudcore部分涉及的模块有:
- cloudhub
- edgecontroller
- devicecontroller
cloud/cmd/cloudcore/app/server.go
// registerModules register all the modules started in cloudcore
func registerModules() {
cloudhub.Register()
edgecontroller.Register()
devicecontroller.Register()
}
以下以cloudhub为例说明注册的过程。
cloudhub结构体主要包含:
- context:上下文,用来传递消息上下文
- stopChan:go channel通信
beehive框架中的模块需要实现Module接口,因此cloudhub也实现了该接口,其中核心方法为Start,用来启动相应模块的运行。
vendor/github.com/kubeedge/beehive/pkg/core/module.go
// Module interface
type Module interface {
Name() string
Group() string
Start(c *context.Context)
Cleanup()
}
以下为cloudHub结构体及注册函数。
cloud/pkg/cloudhub/cloudhub.go
type cloudHub struct {
context *context.Context
stopChan chan bool
}
func Register() {
core.Register(&cloudHub{})
}
具体的注册实现函数为core.Register,注册过程实际上就是将具体的模块结构体放入一个以模块名为key的map映射中,待后续调用。
vendor/github.com/kubeedge/beehive/pkg/core/module.go
// Register register module
func Register(m Module) {
if isModuleEnabled(m.Name()) {
modules[m.Name()] = m //将具体的模块结构体放入一个以模块名为key的map映射中
log.LOGGER.Info("module " + m.Name() + " registered")
} else {
disabledModules[m.Name()] = m
log.LOGGER.Info("module " + m.Name() +
" is not register, please check modules.yaml")
}
}
4. core.Run
CloudCoreCommand命令的Run函数实际上是运行beehive框架中注册的所有模块。
其中包括两部分逻辑:
- 启动运行所有注册模块
- 监听信号并做优雅清理
vendor/github.com/kubeedge/beehive/pkg/core/core.go
//Run starts the modules and in the end does module cleanup
func Run() {
//Address the module registration and start the core
StartModules()
// monitor system signal and shutdown gracefully
GracefulShutdown()
}
5. StartModules
StartModules获取context上下文,并以goroutine的方式运行所有已注册的模块。其中Start函数即每个模块的具体实现Module接口中的Start方法。不同模块各自定义自己的具体Start方法实现。
coreContext := context.GetContext(context.MsgCtxTypeChannel)
go module.Start(coreContext)
具体实现如下:
vendor/github.com/kubeedge/beehive/pkg/core/core.go
// StartModules starts modules that are registered
func StartModules() {
coreContext := context.GetContext(context.MsgCtxTypeChannel)
modules := GetModules()
for name, module := range modules {
//Init the module
coreContext.AddModule(name)
//Assemble typeChannels for send2Group
coreContext.AddModuleGroup(name, module.Group())
go module.Start(coreContext)
log.LOGGER.Info("starting module " + name)
}
}
6. GracefulShutdown
当收到相关信号,则执行各个模块实现的Cleanup方法。
vendor/github.com/kubeedge/beehive/pkg/core/core.go
// GracefulShutdown is if it gets the special signals it does modules cleanup
func GracefulShutdown() {
c := make(chan os.Signal)
signal.Notify(c, syscall.SIGINT, syscall.SIGHUP, syscall.SIGTERM,
syscall.SIGQUIT, syscall.SIGILL, syscall.SIGTRAP, syscall.SIGABRT)
select {
case s := <-c:
log.LOGGER.Info("got os signal " + s.String())
//Cleanup each modules
modules := GetModules()
for name, module := range modules {
log.LOGGER.Info("Cleanup module " + name)
module.Cleanup()
}
}
}
参考:
15.1.2.2 -
kubeedge源码分析之edgecore
本文源码分析基于kubeedge v1.1.0
本文主要分析edgecore中EdgeCoreCommand的基本流程,具体的edged、edgehub、metamanager等模块的实现逻辑待后续单独文章分析。
目录结构:
edgecore
├── app
│ ├── options
│ │ └── options.go
│ └── server.go # NewEdgeCoreCommand 、registerModules
└── edgecore.go # main
edgecore模块包含:
- edged
- edgehub
- metamanager
- eventbus
- servicebus
- devicetwin
- edgemesh
1. main函数
main入口函数,仍然是cobra命令框架格式。
edge/cmd/edgecore/edgecore.go
func main() {
command := app.NewEdgeCoreCommand()
logs.InitLogs()
defer logs.FlushLogs()
if err := command.Execute(); err != nil {
os.Exit(1)
}
}
2. NewEdgeCoreCommand
NewEdgeCoreCommand与NewCloudCoreCommand一样构造对应的cobra command结构体。
edge/cmd/edgecore/app/server.go
// NewEdgeCoreCommand create edgecore cmd
func NewEdgeCoreCommand() *cobra.Command {
opts := options.NewEdgeCoreOptions()
cmd := &cobra.Command{
Use: "edgecore",
Long: `Edgecore is the core edge part of KubeEdge, which contains six modules: devicetwin, edged,
edgehub, eventbus, metamanager, and servicebus. DeviceTwin is responsible for storing device status
and syncing device status to the cloud. It also provides query interfaces for applications. Edged is an
agent that runs on edge nodes and manages containerized applications and devices. Edgehub is a web socket
client responsible for interacting with Cloud Service for the edge computing (like Edge Controller as in the KubeEdge
Architecture). This includes syncing cloud-side resource updates to the edge, and reporting
edge-side host and device status changes to the cloud. EventBus is a MQTT client to interact with MQTT
servers (mosquito), offering publish and subscribe capabilities to other components. MetaManager
is the message processor between edged and edgehub. It is also responsible for storing/retrieving metadata
to/from a lightweight database (SQLite).ServiceBus is a HTTP client to interact with HTTP servers (REST),
offering HTTP client capabilities to components of cloud to reach HTTP servers running at edge. `,
Run: func(cmd *cobra.Command, args []string) {
verflag.PrintAndExitIfRequested()
flag.PrintFlags(cmd.Flags())
// To help debugging, immediately log version
klog.Infof("Version: %+v", version.Get())
registerModules()
// start all modules
core.Run()
},
}
fs := cmd.Flags()
namedFs := opts.Flags()
verflag.AddFlags(namedFs.FlagSet("global"))
globalflag.AddGlobalFlags(namedFs.FlagSet("global"), cmd.Name())
for _, f := range namedFs.FlagSets {
fs.AddFlagSet(f)
}
usageFmt := "Usage:\n %s\n"
cols, _, _ := term.TerminalSize(cmd.OutOrStdout())
cmd.SetUsageFunc(func(cmd *cobra.Command) error {
fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())
cliflag.PrintSections(cmd.OutOrStderr(), namedFs, cols)
return nil
})
cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine())
cliflag.PrintSections(cmd.OutOrStdout(), namedFs, cols)
})
return cmd
}
核心代码:
opts := options.NewEdgeCoreOptions()
registerModules()
core.Run()
3. registerModules
edgecore仍然采用Beehive通信框架,模块调用前先注册对应的模块。具体参考cloudcore.registerModules处的分析,此处不再展开分析注册流程。此处注册的是edgecore中涉及的组件。
edge/cmd/edgecore/app/server.go
// registerModules register all the modules started in edgecore
func registerModules() {
devicetwin.Register()
edged.Register()
edgehub.Register()
eventbus.Register()
edgemesh.Register()
metamanager.Register()
servicebus.Register()
test.Register()
dbm.InitDBManager()
}
4. core.Run
core.Run与cloudcore.run处逻辑一致不再展开分析。
vendor/github.com/kubeedge/beehive/pkg/core/core.go
//Run starts the modules and in the end does module cleanup
func Run() {
//Address the module registration and start the core
StartModules()
// monitor system signal and shutdown gracefully
GracefulShutdown()
}
参考:
15.2 - OpenYurt
15.2.1 - OpenYurt部署
本文主要介绍部署openyurt组件到k8s集群中。
1. 给云端节点和边缘节点打标签
openyurt将k8s节点分为云端节点和边缘节点,云端节点主要运行一些云端的业务,边缘节点运行边缘业务。当与 apiserver 断开连接时,只有运行在边缘自治的节点上的Pod才不会被驱逐。通过打 openyurt.io/is-edge-worker 的标签的方式来区分,false表示云端节点,true表示边缘节点。
云端组件:
-
yurt-controller-manager
-
yurt-tunnel-server
边缘组件:
-
yurt-hub
-
yurt-tunnel-agent
1.1. openyurt.io/is-edge-worker节点标签
# 云端节点,值为false
kubectl label node us-west-1.192.168.0.87 openyurt.io/is-edge-worker=false
# 边缘节点,值为true
kubectl label node us-west-1.192.168.0.88 openyurt.io/is-edge-worker=true
1.2. 给边缘节点开启自治模式
kubectl annotate node us-west-1.192.168.0.88 node.beta.openyurt.io/autonomy=true
2. 部署 Yurt-controller-manager(cloud)
yurt-controller-manager用来避免节点与apiserver失联时,自治边缘节点pod被驱逐。
wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurt-controller-manager.yaml
kubectl apply -f yurt-controller-manager.yaml
yaml文件位于https://github.com/openyurtio/openyurt/tree/master/config/setup
禁用默认的 nodelifecycle 控制器
nodelifecycle控制器主要用来根据node的status及lease的更新时间来决定是否要驱逐节点上的pod。为了让 yurt-controller-mamanger 能够正常工作,因此需要禁用controller的驱逐功能。
vim /etc/kubernetes/manifests/kube-controller-manager.yaml
# 在--controllers=*,bootstrapsigner,tokencleaner后面添加,-nodelifecycle
# 即参数为: --controllers=*,bootstrapsigner,tokencleaner,-nodelifecycle
# 如果kube-controller-manager是以static pod部署,修改yaml文件后会自动重启。
3. 部署 Yurthub(edge)
在 yurt-controller-manager 启动并正常运行后,以静态 pod 的方式部署 Yurthub。
- 为 yurthub 创建全局配置(即RBAC, configmap)
wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurthub-cfg.yaml
kubectl apply -f yurthub-cfg.yaml
- 在边缘节点以static pod方式创建yurthub
mkdir -p /etc/kubernetes/manifests/
cd /etc/kubernetes/manifests/
wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurthub.yaml
# 获取bootstrap token
kubeadm token create
# 假设 apiserver 的地址是 1.2.3.4:6443,bootstrap token 是 07401b.f395accd246ae52d
sed -i 's|__kubernetes_master_address__|1.2.3.4:6443|;
s|__bootstrap_token__|07401b.f395accd246ae52d|' /etc/kubernetes/manifests/yurthub.yaml
4. 重置 Kubelet
重置 kubelet 服务,让它通过 yurthub 访问apiserver。为 kubelet 服务创建一个新的 kubeconfig 文件来访问apiserver。
mkdir -p /var/lib/openyurt
cat << EOF > /var/lib/openyurt/kubelet.conf
apiVersion: v1
clusters:
- cluster:
server: http://127.0.0.1:10261
name: default-cluster
contexts:
- context:
cluster: default-cluster
namespace: default
user: default-auth
name: default-context
current-context: default-context
kind: Config
preferences: {}
EOF
修改/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
sed -i "s|KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=\/etc\/kubernetes\/bootstrap-kubelet.conf\ --kubeconfig=\/etc\/kubernetes\/kubelet.conf|KUBELET_KUBECONFIG_ARGS=--kubeconfig=\/var\/lib\/openyurt\/kubelet.conf|g" \
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
重启kubelet服务
systemctl daemon-reload && systemctl restart kubelet
5. 部署 Yurt-tunnel (可选)
5.1. 部署云端的 yurt-tunnel-server
wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurt-tunnel-server.yaml
kubectl apply -f yurt-tunnel-server.yaml
5.2. 部署边缘的yurt-tunnel-agent
wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurt-tunnel-agent.yaml
kubectl apply -f yurt-tunnel-agent.yaml
由于yurt-tunnel-server默认使用host模式,因此可能存在边缘端的agent无法访问云端的tunnel-server,需要为tunnel-server配置一个可访问的地址。
参考:
16 - 虚拟化
16.1 - 虚拟化相关概念
1. 虚拟化
借助虚拟化技术,用户能以单个物理硬件系统为基础,创建多个模拟环境或专用资源,并使用一款名为“Hypervisor”(虚拟机监控程序)的软件直接连接到硬件,从而将一个系统划分为不同、单独而安全的环境,即虚拟机 (VM)。
虚拟化技术可以重新划分IT资源,提高资源的利用率。
2. 虚拟化的类型
全虚拟化(Full virtualization)
全虚拟化使用未修改的guest操作系统版本,guest直接与CPU通信,是最快的虚拟化方法。
半虚拟化(Paravirtualization)
半虚拟化使用修改过的guest操作系统,guest与hypervisor通信,hypervisor将guest的调用传递给CPU和其他接口。因为通信经过hypervisor,因此比全虚拟化慢。
3. hypervisor
hypervisor又称为 virtual machine monitor (VMM),是一个创建和运行虚拟机的程序。被 hypervisor 用来运行一个或多个虚拟机的计算机称为宿主机(host machine),这些虚拟机则称为客户机(guest machine)。
4. kvm
kvm(Kernel-based Virtual Machine)是Linux内核的虚拟化模块,可以利用Linux内核的功能来作为hypervisor。
KVM本身不进行模拟,而是暴露一个/dev/kvm接口。
使用KVM,可以在Linux的镜像上
5. qemu
QEMU(quick emulator)
待补充
6. libvirt
libvirt是一个硬件虚拟化的管理工具API,可用于KVM、QEMU等虚拟化技术,
参考:
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/virtualization_getting_started_guide/chap-virtualization_getting_started-what_is_it
- https://en.wikipedia.org/wiki/Hypervisor
- https://www.linux-kvm.org/page/Main_Page
- http://www.linux-kvm.org/page/Documents
- https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine
- https://wiki.qemu.org/Index.html
16.2 - KubeVirt
16.2.1 - KubeVirt的介绍
本文主要由云原生虚拟化:基于 Kubevirt 构建边缘计算实例文章重新整理而成。
1. kubevirt简介
kubevirt是基于k8s之上,提供了一种通过k8s来编排和管理虚拟机的方式。
2. 架构图

2.1. 组件说明
| 分类 | 组件 | 部署方式 | 功能说明 |
|---|---|---|---|
| 控制面 | virt-api | deployment | 自定义API,开机、关机、重启等,作为apiserver的插件,业务通过k8s apiserver请求virt-api。 |
| virt-controller | deployment | 管理和监控VMI对象的状态,控制VMI下的pod。 | |
| 节点侧 | virt-handler | daemonset | 类似kubelet,管理宿主机上的所有虚拟机实例。 |
| virt-launcher | virt-handler pod | 调用libvirt和qemu创建虚拟机进程。 |
virt-launcher与libvirt逻辑:

2.2. 自定义CRD对象
| 分类 | CRD对象 | 功能说明 |
|---|---|---|
| 虚机 | VirtualMachineInstance(VMI) | 代表运行的虚拟机实例 |
| VirtualMachine(VM) | 虚机对象,提供开机、关机、重启,管理VMI实例,与VMI的关系是1:1 | |
3. 创建虚拟机流程
待补充
参考:
16.2.2 - KubeVirt的使用
1. 安装kubevirt
1.1. 修改镜像仓库
针对私有环境,需要将所需镜像上传到自己的镜像仓库中。
涉及的镜像组件有
virt-operator
virt-api
virt-controller
virt-launcher
重命名镜像脚本如下:
#!/bin/bash
# kubevirt组件版本
version=$1
# 私有镜像仓库
registry=$2
# 私有镜像仓库的namespace
namespace=$3
kubevirtRegistry="quay.io/kubevirt"
readonly APPLIST=(
virt-operator
virt-api
virt-controller
virt-launcher
)
for app in "${APPLIST[@]}"; do
# 拉取镜像
docker pull ${kubevirtRegistry}/${app}:${version}
# 重命名
docker tag ${kubevirtRegistry}/${app}:${version} ${registry}/${namespace}/${app}:${version}
# 推送镜像
docker push ${registry}/${namespace}/${app}:${version}
done
echo "重新命名成功"
1.2. 部署virt-operator
通过kubevirt operator安装kubevirt相关组件,选择指定版本,下载kubevirt-operator.yaml和kubevirt-cr.yaml文件,并创建k8s相关对象。
如果是私有镜像仓库,则需要将kubevirt-operator.yaml文件中镜像的名字替换为私有镜像仓库的地址,并提前按步骤1推送所需镜像到私有镜像仓库。
# Pick an upstream version of KubeVirt to install
$ export RELEASE=v0.52.0
# Deploy the KubeVirt operator
$ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-operator.yaml
# Create the KubeVirt CR (instance deployment request) which triggers the actual installation
$ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-cr.yaml
# wait until all KubeVirt components are up
$ kubectl -n kubevirt wait kv kubevirt --for condition=Available
1.3. 部署virtctl
virtctl用来启动和关闭虚拟机。
VERSION=$(kubectl get kubevirt.kubevirt.io/kubevirt -n kubevirt -o=jsonpath="{.status.observedKubeVirtVersion}")
ARCH=$(uname -s | tr A-Z a-z)-$(uname -m | sed 's/x86_64/amd64/') || windows-amd64.exe
echo ${ARCH}
curl -L -o virtctl https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/virtctl-${VERSION}-${ARCH}
chmod +x virtctl
sudo install virtctl /usr/local/bin
2. kubevirt部署产物
通过手动部署virt-operator,会自动部署以下组件
| 组件 | 部署方式 | 副本数 |
|---|---|---|
| virt-api | deployment | 2 |
| virt-controller | deployment | 2 |
| virt-handler | daemonset | - |
具体参考:
#kg all -n kubevirt
NAME READY STATUS RESTARTS AGE
pod/virt-api-5fb5cffb7f-hgjjh 1/1 Running 0 23h
pod/virt-api-5fb5cffb7f-jcp7x 1/1 Running 0 23h
pod/virt-controller-844cd4f58c-h8vsx 1/1 Running 0 23h
pod/virt-controller-844cd4f58c-vlxqs 1/1 Running 0 23h
pod/virt-handler-lb5ft 1/1 Running 0 23h
pod/virt-handler-mtr4d 1/1 Running 0 22h
pod/virt-handler-sxd2t 1/1 Running 0 23h
pod/virt-operator-8595f577cd-b9txg 1/1 Running 0 23h
pod/virt-operator-8595f577cd-p2f69 1/1 Running 0 23h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubevirt-operator-webhook ClusterIP 10.254.159.81 <none> 443/TCP 23h
service/kubevirt-prometheus-metrics ClusterIP 10.254.7.231 <none> 443/TCP 23h
service/virt-api ClusterIP 10.254.244.139 <none> 443/TCP 23h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/virt-handler 3 3 3 3 3 kubernetes.io/os=linux 23h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/virt-api 2/2 2 2 23h
deployment.apps/virt-controller 2/2 2 2 23h
deployment.apps/virt-operator 2/2 2 2 23h
NAME DESIRED CURRENT READY AGE
replicaset.apps/virt-api-5fb5cffb7f 2 2 2 23h
replicaset.apps/virt-controller-844cd4f58c 2 2 2 23h
replicaset.apps/virt-operator-8595f577cd 2 2 2 23h
NAME AGE PHASE
kubevirt.kubevirt.io/kubevirt 23h Deployed
3. 创建虚拟机
通过vm.yaml创建虚拟机
# 下载vm.yaml
wget https://kubevirt.io/labs/manifests/vm.yaml
# 创建虚拟机
kubectl apply -f https://kubevirt.io/labs/manifests/vm.yaml
vm.yaml文件
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: testvm
spec:
running: false
template:
metadata:
labels:
kubevirt.io/size: small
kubevirt.io/domain: testvm
spec:
domain:
devices:
disks:
- name: containerdisk
disk:
bus: virtio
- name: cloudinitdisk
disk:
bus: virtio
interfaces:
- name: default
masquerade: {}
resources:
requests:
memory: 64M
networks:
- name: default
pod: {}
volumes:
- name: containerdisk
containerDisk:
image: quay.io/kubevirt/cirros-container-disk-demo
- name: cloudinitdisk
cloudInitNoCloud:
userDataBase64: SGkuXG4=
查看虚拟机
kubectl get vms
kubectl get vms -o yaml testvm
启动或暂停虚拟机
# 启动虚拟机
virtctl start testvm
# 关闭虚拟机
virtctl stop testvm
# 进入虚拟机
virtctl console testvm
删除虚拟机
kubectl delete vm testvm
参考:
17 - 监控体系
17.1 - Kubernetes集群监控
1. 概述
1.1. cAdvisor
cAdvisor对Node机器上的资源及容器进行实时监控和性能数据采集,包括CPU使用情况、内存使用情况、网络吞吐量及文件系统使用情况,cAdvisor集成在Kubelet中,当kubelet启动时会自动启动cAdvisor,即一个cAdvisor仅对一台Node机器进行监控。kubelet的启动参数--cadvisor-port可以定义cAdvisor对外提供服务的端口,默认为4194。可以通过浏览器Node_IP:port访问。项目主页:http://github.com/google/cadvisor。
1.2. Heapster
是对集群中的各个Node、Pod的资源使用数据进行采集,通过访问每个Node上Kubelet的API,再通过Kubelet调用cAdvisor的API来采集该节点上所有容器的性能数据。由Heapster进行数据汇聚,保存到后端存储系统中,例如InfluxDB,Google Cloud Logging等。项目主页为:https://github.com/kubernetes/heapster。
1.3. InfluxDB
是分布式时序数据库(每条记录带有时间戳属性),主要用于实时数据采集、事件跟踪记录、存储时间图表、原始数据等。提供REST API用于数据的存储和查询。项目主页为http://InfluxDB.com。
1.4. Grafana
通过Dashboard将InfluxDB的时序数据展现成图表形式,便于查看集群运行状态。项目主页为http://Grafana.org。
1.5. 总体架构图

其中当前Kubernetes中,Heapster、InfluxDB、Grafana均以Pod的形式启动和运行。Heapster与Master需配置安全连接。
2. 部署与使用
2.1. cAdvisor
kubelet的启动参数--cadvisor-port可以定义cAdvisor对外提供服务的端口,默认为4194。可以通过浏览器Node_IP:port访问。也提供了REST API供客户端远程调用,API返回的格式为JSON,可以采用URL访问:http://hostname:port/api/version/request/
例如:http://14.152.49.100:4194/api/v1.3/machine 获取主机信息。
2.2. Service
2.2.1. heapster-service
heapster-service.yaml
apiVersion:v1
kind:Service
metadata:
label:
kubenetes.io/cluster-service:"true"
kubernetes.io/name:Heapster
name:heapster
namespace:kube-system
spec:
ports:
- port:80
targetPort:8082
selector:
k8s-app:heapster
2.2.2. influxdb-service
influxdb-service.yaml
apiVersion:v1
kind:Service
metadata:
label:null
name:monitoring-InfluxDB
namespace:kube-system
spec:
type:Nodeport
ports:
- name:http
port:80
targetPort:8083
- name:api
port:8086
targetPort:8086
Nodeport:8086
selector:
name:influxGrafana
2.2.3. grafana-service
grafana-service.yaml
apiVersion:v1
kind:Service
metadata:
label:
kubenetes.io/cluster-service:"true"
kubernetes.io/name:monitoring-Grafana
name:monitoring-Grafana
namespace:kube-system
spec:
type:Nodeport
ports:
port:80
targetPort:8080
Nodeport:8085
selector:
name:influxGrafana
使用type=NodePort将InfluxDB和Grafana暴露在Node的端口上,以便通过浏览器进行访问。
2.2.4. 创建service
kubectl create -f heapster-service.yaml
kubectl create -f InfluxDB-service.yaml
kubectl create -f Grafana-service.yaml
2.3. ReplicationController
2.3.1. influxdb-grafana-controller
influxdb-grafana-controller-v3.yaml
apiVersion:v1
kind:ReplicationController
metadata:
name:monitoring-influxdb-grafana-v3
namespace:kube-system
labels:
k8s-app:influxGrafana
version:v3
kubernetes.io/cluster-service:"true
spec:
replicas:1
selector:
k8s-app:influxGrafana
version:v3
template:
metadata:
labels:
k8s-app:influxGrafana
version:v3
kubernetes.io/cluster-service:"true
spec:
containers:
- image:gcr.io/google_containers/heapster_influxdb:v0.5
name:influxdb
resources:
limits:
cpu:100m
memory:500Mi
requests:
cpu:100m
memory:500Mi
ports:
- containerPort:8083
- containerPort:8086
volumeMounts:
-name:influxdb-persistent-storage
mountPath:/data
- image:grc.io/google_containers/heapster_grafana:v2.6.0-2
name:grafana
resources:
limits:
cpu:100m
memory:100Mi
requests:
cpu:100m
memory:100Mi
env:
- name:INFLUXDB_SERVICE_URL
value:http://monitoring-influxdb:8086
- name:GF_AUTH_BASIC_ENABLED
value:"false"
- name:GF_AUTH_ANONYMOUS_ENABLED
value:"true"
- name:GF_AUTH_ANONYMOUS_ORG_ROLE
value:Admin
- name:GF_SERVER_ROOT_URL
value:/api/v1/proxy/namespace/kube-system/services/monitoring-grafana/
volumeMounts:
- name:grafana-persistent-storage
mountPath:/var
volumes:
- name:influxdb-persistent-storage
emptyDir{}
- name:grafana-persistent-storage
emptyDir{}
2.3.2. heapster-controller
heapster-controller.yaml
apiVersion:v1
kind:ReplicationController
metadata:
labels:
k8s-app:heapster
name:heapster
version:v6
name:heapster
namespace:kube-system
spec:
replicas:1
selector:
name:heapster
k8s-app:heapster
version:v6
template:
metadata:
labels:
k8s-app:heapster
version:v6
spec:
containers:
- image:gcr.io/google_containers/heapster:v0.17.0
name:heapster
command:
- /heapster
- --source=kubernetes:http://192.168.1.128:8080?inClusterConfig=flase&kubeletHttps=true&useServiceAccount=true&auth=
- --sink=InfluxDB:http://monitoring-InfluxDB:8086
Heapster设置启动参数说明:
1、–source
配置监控来源,本例中表示从k8s-Master获取各个Node的信息。在URL的参数部分,修改kubeletHttps、inClusterConfig、useServiceAccount的值。
2、–sink
配置后端的存储系统,本例中使用InfluxDB。URL中主机名的地址是InfluxDB的Service名字,需要DNS服务正常工作,如果没有配置DNS服务可使用Service的ClusterIP地址。
2.3.3. 创建ReplicationController
kubelet create -f InfluxDB-Grafana-controller.yaml
kubelet create -f heapster-controller.yaml
3. 查看界面及数据
3.1. InfluxDB
访问任意一台Node机器的30083端口。
3.2. Grafana
访问任意一台Node机器的30080端口。
4. 容器化部署
4.1. 拉取镜像
docker pull influxdb:latest
docker pull cadvisor:latest
docker pull grafana:latest
docker pull heapster:latest
4.2. 运行容器
4.2.1. influxdb
#influxdb
docker run -d -p 8083:8083 -p 8086:8086 --expose 8090 --expose 8099 --volume=/opt/data/influxdb:/data --name influxsrv influxdb:latest
4.2.2. cadvisor
#cadvisor
docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 --detach=true --link influxsrv:influxsrv --name=cadvisor cadvisor:latest -storage_driver=influxdb -storage_driver_db=cadvisor -storage_driver_host=influxsrv:8086
4.2.3. grafana
#grafana
docker run -d -p 3000:3000 -e INFLUXDB_HOST=influxsrv -e INFLUXDB_PORT=8086 -e INFLUXDB_NAME=cadvisor -e INFLUXDB_USER=root -e INFLUXDB_PASS=root --link influxsrv:influxsrv --name grafana grafana:latest
4.2.4. heapster
docker run -d -p 8082:8082 --net=host heapster:canary --source=kubernetes:http://`k8s-server-ip`:8080?inClusterConfig=false/&useServiceAccount=false --sink=influxdb:http://`influxdb-ip`:8086
4.3. 访问
在浏览器输入IP:PORT
17.2 - cAdvisor介绍
1. cAdvisor简介
cAdvisor对Node机器上的资源及容器进行实时监控和性能数据采集,包括CPU使用情况、内存使用情况、网络吞吐量及文件系统使用情况,cAdvisor集成在Kubelet中,当kubelet启动时会自动启动cAdvisor,即一个cAdvisor仅对一台Node机器进行监控。kubelet的启动参数--cadvisor-port可以定义cAdvisor对外提供服务的端口,默认为4194。可以通过浏览器<Node_IP:port>访问。项目主页:http://github.com/google/cadvisor。
2. cAdvisor结构图

3. Metrics
| 分类 | 字段 | 描述 |
|---|---|---|
| cpu | cpu_usage_total | |
| cpu_usage_system | ||
| cpu_usage_user | ||
| cpu_usage_per_cpu | ||
| load_average | Smoothed average of number of runnable threads x 1000 | |
| memory | memory_usage | Memory Usage |
| memory_working_set | Working set size | |
| network | rx_bytes | Cumulative count of bytes received |
| rx_errors | Cumulative count of receive errors encountered | |
| tx_bytes | Cumulative count of bytes transmitted | |
| tx_errors | Cumulative count of transmit errors encountered | |
| filesystem | fs_device | Filesystem device |
| fs_limit | Filesystem limit | |
| fs_usage | Filesystem usage |
4. cAdvisor源码
4.1. cAdvisor入口函数
cadvisor.go
func main() {
defer glog.Flush()
flag.Parse()
if *versionFlag {
fmt.Printf("cAdvisor version %s (%s)/n", version.Info["version"], version.Info["revision"])
os.Exit(0)
}
setMaxProcs()
memoryStorage, err := NewMemoryStorage()
if err != nil {
glog.Fatalf("Failed to initialize storage driver: %s", err)
}
sysFs, err := sysfs.NewRealSysFs()
if err != nil {
glog.Fatalf("Failed to create a system interface: %s", err)
}
collectorHttpClient := createCollectorHttpClient(*collectorCert, *collectorKey)
containerManager, err := manager.New(memoryStorage, sysFs, *maxHousekeepingInterval, *allowDynamicHousekeeping, ignoreMetrics.MetricSet, &collectorHttpClient)
if err != nil {
glog.Fatalf("Failed to create a Container Manager: %s", err)
}
mux := http.NewServeMux()
if *enableProfiling {
mux.HandleFunc("/debug/pprof/", pprof.Index)
mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
}
// Register all HTTP handlers.
err = cadvisorhttp.RegisterHandlers(mux, containerManager, *httpAuthFile, *httpAuthRealm, *httpDigestFile, *httpDigestRealm)
if err != nil {
glog.Fatalf("Failed to register HTTP handlers: %v", err)
}
cadvisorhttp.RegisterPrometheusHandler(mux, containerManager, *prometheusEndpoint, nil)
// Start the manager.
if err := containerManager.Start(); err != nil {
glog.Fatalf("Failed to start container manager: %v", err)
}
// Install signal handler.
installSignalHandler(containerManager)
glog.Infof("Starting cAdvisor version: %s-%s on port %d", version.Info["version"], version.Info["revision"], *argPort)
addr := fmt.Sprintf("%s:%d", *argIp, *argPort)
glog.Fatal(http.ListenAndServe(addr, mux))
}
核心代码:
memoryStorage, err := NewMemoryStorage()
sysFs, err := sysfs.NewRealSysFs()
#创建containerManager
containerManager, err := manager.New(memoryStorage, sysFs, *maxHousekeepingInterval, *allowDynamicHousekeeping, ignoreMetrics.MetricSet, &collectorHttpClient)
#启动containerManager
err := containerManager.Start()
4.2. cAdvisor Client的使用
import "github.com/google/cadvisor/client"
func main(){
client, err := client.NewClient("http://192.168.19.30:4194/") //http://<host-ip>:<port>/
}
4.2.1 client定义
cadvisor/client/client.go
// Client represents the base URL for a cAdvisor client.
type Client struct {
baseUrl string
}
// NewClient returns a new v1.3 client with the specified base URL.
func NewClient(url string) (*Client, error) {
if !strings.HasSuffix(url, "/") {
url += "/"
}
return &Client{
baseUrl: fmt.Sprintf("%sapi/v1.3/", url),
}, nil
}
4.2.2. client方法
1)MachineInfo
// MachineInfo returns the JSON machine information for this client.
// A non-nil error result indicates a problem with obtaining
// the JSON machine information data.
func (self *Client) MachineInfo() (minfo *v1.MachineInfo, err error) {
u := self.machineInfoUrl()
ret := new(v1.MachineInfo)
if err = self.httpGetJsonData(ret, nil, u, "machine info"); err != nil {
return
}
minfo = ret
return
}
2)ContainerInfo
// ContainerInfo returns the JSON container information for the specified
// container and request.
func (self *Client) ContainerInfo(name string, query *v1.ContainerInfoRequest) (cinfo *v1.ContainerInfo, err error) {
u := self.containerInfoUrl(name)
ret := new(v1.ContainerInfo)
if err = self.httpGetJsonData(ret, query, u, fmt.Sprintf("container info for %q", name)); err != nil {
return
}
cinfo = ret
return
}
3)DockerContainer
// Returns the JSON container information for the specified
// Docker container and request.
func (self *Client) DockerContainer(name string, query *v1.ContainerInfoRequest) (cinfo v1.ContainerInfo, err error) {
u := self.dockerInfoUrl(name)
ret := make(map[string]v1.ContainerInfo)
if err = self.httpGetJsonData(&ret, query, u, fmt.Sprintf("Docker container info for %q", name)); err != nil {
return
}
if len(ret) != 1 {
err = fmt.Errorf("expected to only receive 1 Docker container: %+v", ret)
return
}
for _, cont := range ret {
cinfo = cont
}
return
}
4)AllDockerContainers
// Returns the JSON container information for all Docker containers.
func (self *Client) AllDockerContainers(query *v1.ContainerInfoRequest) (cinfo []v1.ContainerInfo, err error) {
u := self.dockerInfoUrl("/")
ret := make(map[string]v1.ContainerInfo)
if err = self.httpGetJsonData(&ret, query, u, "all Docker containers info"); err != nil {
return
}
cinfo = make([]v1.ContainerInfo, 0, len(ret))
for _, cont := range ret {
cinfo = append(cinfo, cont)
}
return
}
17.3 - Heapster介绍
1. heapster简介
Heapster是容器集群监控和性能分析工具,天然的支持Kubernetes和CoreOS。 Kubernetes有个出名的监控agent—cAdvisor。在每个kubernetes Node上都会运行cAdvisor,它会收集本机以及容器的监控数据(cpu,memory,filesystem,network,uptime)。
2. heapster部署与配置
2.1. 注意事项
需同步部署机器和被采集机器的时间:ntpdate time.windows.com
加入定时任务,定期同步时间
crontab –e
30 5 * * * /usr/sbin/ntpdate time.windows.com //每天早晨5点半执行
2.2. 容器式部署
#拉取镜像
docker pull heapster:latest
#运行容器
docker run -d -p 8082:8082 --net=host heapster:latest --source=kubernetes:http://<k8s-server-ip>:8080?inClusterConfig=false\&useServiceAccount=false --sink=influxdb:http://<influxdb-ip>:8086?db=<k8s_env_zone>
2.3. 配置说明
可以参考官方文档
2.3.1. –source
–source: 指定数据获取源。这里指定kube-apiserver即可。 后缀参数: inClusterConfig: kubeletPort: 指定kubelet的使用端口,默认10255 kubeletHttps: 是否使用https去连接kubelets(默认:false) apiVersion: 指定K8S的apiversion insecure: 是否使用安全证书(默认:false) auth: 安全认证 useServiceAccount: 是否使用K8S的安全令牌
2.3.2. –sink
–sink: 指定后端数据存储。这里指定influxdb数据库。 后缀参数: user: InfluxDB用户 pw: InfluxDB密码 db: 数据库名 secure: 安全连接到InfluxDB(默认:false) withfields: 使用InfluxDB fields(默认:false)。
3. Metrics
| 分类 | Metric Name | Description | 备注 |
|---|---|---|---|
| cpu | cpu/limit | CPU hard limit in millicores. | CPU上限 |
| cpu/node_capacity | Cpu capacity of a node. | Node节点的CPU容量 | |
| cpu/node_allocatable | Cpu allocatable of a node. | Node节点可分配的CPU | |
| cpu/node_reservation | Share of cpu that is reserved on the node allocatable. | ||
| cpu/node_utilization | CPU utilization as a share of node allocatable. | ||
| cpu/request | CPU request (the guaranteed amount of resources) in millicores. | ||
| cpu/usage | Cumulative CPU usage on all cores. | CPU总使用量 | |
| cpu/usage_rate | CPU usage on all cores in millicores. | ||
| filesystem | filesystem/usage | Total number of bytes consumed on a filesystem. | 文件系统的使用量 |
| filesystem/limit | The total size of filesystem in bytes. | 文件系统的使用上限 | |
| filesystem/available | The number of available bytes remaining in a the filesystem | 可用的文件系统容量 | |
| filesystem/inodes | The number of available inodes in a the filesystem | ||
| filesystem/inodes_free | The number of free inodes remaining in a the filesystem | ||
| memory | memory/limit | Memory hard limit in bytes. | 内存上限 |
| memory/major_page_faults | Number of major page faults. | ||
| memory/major_page_faults_rate | Number of major page faults per second. | ||
| memory/node_capacity | Memory capacity of a node. | ||
| memory/node_allocatable | Memory allocatable of a node. | ||
| memory/node_reservation | Share of memory that is reserved on the node allocatable. | ||
| memory/node_utilization | Memory utilization as a share of memory allocatable. | ||
| memory/page_faults | Number of page faults. | ||
| memory/page_faults_rate | Number of page faults per second. | ||
| memory/request | Memory request (the guaranteed amount of resources) in bytes. | ||
| memory/usage | Total memory usage. | ||
| memory/cache | Cache memory usage. | ||
| memory/rss | RSS memory usage. | ||
| memory/working_set | Total working set usage. Working set is the memory being used and not easily dropped by the kernel. | ||
| network | network/rx | Cumulative number of bytes received over the network. | |
| network/rx_errors | Cumulative number of errors while receiving over the network. | ||
| network/rx_errors_rate | Number of errors while receiving over the network per second. | ||
| network/rx_rate | Number of bytes received over the network per second. | ||
| network/tx | Cumulative number of bytes sent over the network | ||
| network/tx_errors | Cumulative number of errors while sending over the network | ||
| network/tx_errors_rate | Number of errors while sending over the network | ||
| network/tx_rate | Number of bytes sent over the network per second. | ||
| uptime | Number of milliseconds since the container was started. | - |
4. Labels
| Label Name | Description |
|---|---|
| pod_id | Unique ID of a Pod |
| pod_name | User-provided name of a Pod |
| pod_namespace | The namespace of a Pod |
| container_base_image | Base image for the container |
| container_name | User-provided name of the container or full cgroup name for system containers |
| host_id | Cloud-provider specified or user specified Identifier of a node |
| hostname | Hostname where the container ran |
| labels | Comma-separated(Default) list of user-provided labels. Format is 'key:value' |
| namespace_id | UID of the namespace of a Pod |
| resource_id | A unique identifier used to differentiate multiple metrics of the same type. e.x. Fs partitions under filesystem/usage |
5. heapster API
见官方文档:https://github.com/kubernetes/heapster/blob/master/docs/model.md
17.4 - Influxdb介绍
1. InfluxDB简介
InfluxDB是一个当下比较流行的时序数据库,InfluxDB使用 Go 语言编写,无需外部依赖,安装配置非常方便,适合构建大型分布式系统的监控系统。
主要特色功能:
1)基于时间序列,支持与时间有关的相关函数(如最大,最小,求和等)
2)可度量性:你可以实时对大量数据进行计算
3)基于事件:它支持任意的事件数据
2. InfluxDB安装
1)安装
wget https://dl.influxdata.com/influxdb/releases/influxdb-0.13.0.x86_64.rpm
yum localinstall influxdb-0.13.0.armhf.rpm
2)启动
service influxdb start
3)访问
http://服务器IP:8083
4)docker image方式安装
docker pull influxdb
docker run -d -p 8083:8083 -p 8086:8086 --expose 8090 --expose 8099 --volume=/opt/data/influxdb:/data --name influxsrv influxdb:latest
3. InfluxDB的基本概念
3.1. 与传统数据库中的名词做比较
| influxDB中的名词 | 传统数据库中的概念 |
|---|---|
| database | 数据库 |
| measurement | 数据库中的表 |
| points | 表里面的一行数据 |
3.2. InfluxDB中独有的概念
3.2.1. Point
Point由时间戳(time)、数据(field)、标签(tags)组成。
Point相当于传统数据库里的一行数据,如下表所示:
| Point属性 | 传统数据库中的概念 |
|---|---|
| time | 每个数据记录时间,是数据库中的主索引(会自动生成) |
| fields | 各种记录值(没有索引的属性)也就是记录的值:温度, 湿度 |
| tags | 各种有索引的属性:地区,海拔 |
3.2.2. series
所有在数据库中的数据,都需要通过图表来展示,而这个series表示这个表里面的数据,可以在图表上画成几条线:通过tags排列组合算出来
show series from cpu
4. InfluxDB的基本操作
InfluxDB提供三种操作方式:
1)客户端命令行方式
2)HTTP API接口
3)各语言API库
4.1. InfluxDB数据库操作
| 操作 | 命令 |
|---|---|
| 显示数据库 | show databases |
| 创建数据库 | create database db_name |
| 删除数据库 | drop database db_name |
| 使用某个数据库 | use db_name |
4.2. InfluxDB数据表操作
| 操作 | 命令 | 说明 |
|---|---|---|
| 显示所有表 | SHOW MEASUREMENTS | |
| 创建数据表 | insert table_name,hostname=server01 value=442221834240i 1435362189575692182 |
其中 disk_free 就是表名,hostname是索引,value=xx是记录值,记录值可以有多个,最后是指定的时间 |
| 删除数据表 | drop measurement table_name |
|
| 查看表内容 | select * from table_name |
|
| 查看series | show series from table_name |
series表示这个表里面的数据,可以在图表上画成几条线,series主要通过tags排列组合算出来 |